18.1699, FYI: Freely Available JRC-Acquis Parallel Corpus

LINGUIST Network linguist at LINGUISTLIST.ORG
Mon Jun 4 18:57:27 UTC 2007


LINGUIST List: Vol-18-1699. Mon Jun 04 2007. ISSN: 1068 - 4875.

Subject: 18.1699, FYI: Freely Available JRC-Acquis Parallel Corpus

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Laura Welcher, Rosetta Project  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Dan Parker <dan at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 01-Jun-2007
From: Ralf Steinberger < Ralf.Steinberger at jrc.it >
Subject: Freely Available JRC-Acquis Parallel Corpus

 

	
-------------------------Message 1 ---------------------------------- 
Date: Mon, 04 Jun 2007 14:54:33
From: Ralf Steinberger < Ralf.Steinberger at jrc.it >
Subject: Freely Available JRC-Acquis Parallel Corpus 
 

We are pleased to announce a new release of the freely available
multilingual parallel corpus JRC-Acquis (version 3.0). The corpus size has
nearly tripled (totaling over 1 Billion words) and Bulgarian texts have now
been added (thanks to the Romanian Academy of Sciences) so that the
parallel texts are now available in 22 languages. 

Size and Format:

- 22 languages (all official EU languages except Irish)
- Average corpus size per language: 28.9 million words + 19 Million words
in annexes, etc.
- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.

Languages:

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish,
French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish,
Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.

Text Types:

- Documents on contents, principles and political objectives of the EU Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.

Paragraph Alignment:

Paragraph alignment for all 231 language pairs will soon be available for
version 3.0 of the corpus. The following text applies to version 2.2, still
available on the same website:

- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.

Manual Subject Domain Classification:

- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.

Use / Download:

- Download from http://langtech.jrc.it/JRC-Acquis.html 
- Usage free for research purposes.

For More Details:

Steinberger Ralf,  Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma
Erjavec, Dan Tufi, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
http://langtech.jrc.it/#Publications. 

The JRC's Language Technology group specialises in the development of
highly multilingual text analysis tools and in cross-lingual applications.
An example is our multilingual (19 languages) news analysis application
NewsExplorer, publicly accessible at http://press.jrc.it/NewsExplorer. 
	
Related JRC developments (both covering 22+ languages):

- NewsBrief (http://press.jrc.it): breaking news detection and display of
the very latest thematic news from around the world;

- Medical Information System MedISys (http://medusa.jrc.it): displays the
latest health-related news from around the world according to themes and
diseases.

Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology 
http://langtech.jrc.it, http://press.jrc.it/NewsExplorer 



Linguistic Field(s): Computational Linguistics
                     Text/Corpus Linguistics
                     Translation






-----------------------------------------------------------
LINGUIST List: Vol-18-1699	

	



More information about the LINGUIST mailing list