Ressources: DGT-ACQUIS, New freely available large-scale aligned parallel corpus

Sat Dec 1 19:34:34 UTC 2012

Date: Wed, 28 Nov 2012 10:26:38 +0100
From: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>
Message-id: <000d01cdcd4a$7852d1e0$68f875a0$@jrc.ec.europa.eu>
X-url: http://langtech.jrc.ec.europa.eu/DGT-Acquis.html

Following the release of the JRC-Acquis in 2006, the DGT-Translation
Memory in several releases since 2007 and the ECDC-Translation Memory in
2012, we are now releasing the new parallel corpus
DGT-Acquis. DGT-Acquis has been produced by the European Commission’s
Directorate General for Translation (DGT) and it is being distributed by
the Joint Research Centre (JRC).

DGT-Acquis is a parallel collection of manually translated full-text
documents in all 23 official EU languages, that has been
paragraph-aligned for all 253 language pairs. It has been produced on
the basis of the Official Journal (OJ) of the European Union (more
specifically the L, LM, C, CA and CE Series).

Languages: All 253 language pairs involving the following 23 languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian,
Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and
Swedish.

URL: http://langtech.jrc.ec.europa.eu/DGT-Acquis.html

Creator: European Commission - Directorate General for Translation (
http://ec.europa.eu/dgs/translation/index_en.htm DGT)

Size: 3.54 million files; 5 GB in plain text format

WHAT IS DGT-Acquis

DGT-Acquis consists of a collection of Official Journal issues published
in up to 23 languages between 2004 and 2011. The full-text documents
have been paragraph-aligned automatically for all language pairs. The
data is being distributed in several formats: (1) the original XML data
and its corresponding TIFF files; (2) file level data in Formex4 format;
(3) file level data in plain text format; and (4) the same data aligned
at paragraph level. Users can thus make use of the aligned data or they
can re-process the data using their own tools and methods.

WHAT IS the difference between DGT-Acquis and the other resources
distributed by the JRC

While the translation memories DGT-TM and ECDC-TM are collections of
individual translation units (or sentences) taken out of their full-text
context, both JRC-Acquis and DGT-Acquis consist of full-text documents
aligned at sentence or paragraph level. This allows using the data for
applications that need to analyse entire texts, e.g. for discourse
structure analysis, to detect domain information, for experiments on
automatic summarisation, for translation studies, etc.

Regarding the contents of the documents, JRC-Acquis and DGT-Acquis
partially overlap for the period 2004 to 2006 while the documents for
all other time periods should be unique. Comparing the resources used to
produce DGT-Acquis and DGT-TM, DGT-TM is based exclusively on the
L-Series of the Official Journal, while DGT-Acquis also contains the LM,
C, CA and CE collections.

The processing steps (data preparation and alignment) to produce the
various data sets were entirely different. The format is not the same,
and the processing quality of each of the resources is expected to be
different, as well. For details on the resources and on the overlap
between them, see the detailed descriptions of the resources at
http://ipsc.jrc.ec.europa.eu/index.php?id=61.

MOTIVATION FOR THIS RELEASE

The public data release is in line with the general effort of the
European Commission to support multilingualism, language diversity and
the re-use of Commission information. It follows the release of the
JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22
languages), of the DGT-TM Translation Memory since 2007, the
multilingual named entity resource JRC-Names in 2011, the multilingual
multi-label classification tool (and accompanying text data) JRC EuroVoc
Indexer (JEX) (22 languages), and further smaller multilingual
resources. See http://ipsc.jrc.ec.europa.eu/index.php?id=61 for more
information on these resources.

WHAT DGT-ACQUIS CAN BE USED FOR

DGT-ACQUIS is a large parallel corpus in electronic form. It can be used
by specialists in computational linguistics to train statistical machine
translation software, to generate multilingual dictionaries, to train
and test multilingual information extraction software, to carry out
testing and training of summarisation or discourse analysis software, to
train and test cross-lingual clustering and classification, and
more. Parallel corpora are also particularly useful for annotation
projection across languages
(http://publications.jrc.ec.europa.eu/repository/handle/111111111/1/simple-search?query=%28%28author%3ASteinberger%29+AND+%28title%3AAnnotation+title%3AParallel%29%29&from_advanced=true&conjunction3=AND&field4=type&conjunction2=AND&field3=ANY&field2=title&conjunction1=AND&query4=&field1=author&query1=Steinberger&query2=Annotation+Parallel&query3=&num_search_field=4)
, which saves annotation effort and thus facilitates the development of
highly multilingual text processing software.

MORE INFORMATION ON DGT-ACQUIS

At http://langtech.jrc.ec.europa.eu/JRC_Publications.html , you find
detailed publications on the JRC’s multilingual language technology
activity. For details on DGT-Acquis, however, there is not currently yet
any detailed publication. Until further notice, please make reference to
it by pointing to the web page
http://langtech.jrc.ec.europa.eu/DGT-Acquis.html.

WHAT NEXT?

The JRC and collaborating European Union services are currently
finalising the release of further highly multilingual linguistic
resources.

Ralf Steinberger http://langtech.jrc.ec.europa.eu/RS.html
European Commission - Joint Research Centre (JRC) 
21027 Ispra (VA), Italy

URL – Applications: http://emm.newsbrief.eu/overview.html

URL – Resources: http://ipsc.jrc.ec.europa.eu/index.php?id=61

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/
-------------------------------------------------------------------------