[Corpora-List] BILINGUAL PARALLEL CORPORA

Ralf Steinberger ralf.steinberger at jrc.it
Tue Nov 14 07:35:15 UTC 2006


Dear J.L., :-)

 

The JRC-Acquis multilingual parallel corpus is freely available for research
purposes. You can find information on the corpus and a link to the download
site at the web page:

 

    http://langtech.jrc.it/JRC-Acquis.html

 

The JRC-Acquis covers the 20 official EU languages plus Romanian. Norwegian
is thus not included, but several other Scandinavian languages are. The
corpus is paragraph-aligned for each of the 190 language pairs. Many of the
paragraphs are single sentences. 

 

I hope this helps. Greetings from the Lago Maggiore in Italy to "some place
of Spain",

 

Ralf

 

PS: JRC's multilingual news aggregation and analysis system NewsExplorer now
tracks longer news stories over time. Check it out at
http://press.jrc.it/NewsExplorer/. 

 

 

Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it,
<http://langtech.jrc.it/RS.html> http://langtech.jrc.it/RS.html)  
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it,  <http://press.jrc.it/NewsExplorer/>
http://press.jrc.it/NewsExplorer) 
21020 Ispra (VA), Italy



 

 

Here is some more information:

 

SIZE AND FORMAT

 

- 21 languages (all 20 official EU languages plus Romanian)

- Average corpus size: 8.8 million words per language

- XML Format according to TEI P4, UTF-8-encoded

- Modular: download the languages you need.

 

LANGUAGES

 

Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,

Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,

Romanian, Slovak, Slovene, Spanish, Swedish.

 

TEXT TYPES

 

- Documents on contents, principles and political objectives of the EU
Treaties

- EU legislation

- Declarations

- Resolutions

- Acts

- International agreements.

 

PARAGRAPH ALIGNMENT

 

- Paragraph-aligned for all 210 language pairs

- Paragraphs are sentence parts, sentences, or groups of sentences

- 2 alternative alignments: using Vanilla and HunAlign

- Ca. 270,000 alignments per language pair.

 

MANUAL SUBJECT DOMAIN CLASSIFICATION

 

- Manually classified according to EUROVOC subject domains

- Selected from 6000 hierarchically organised classes, wide-coverage.

 

USE / DOWNLOAD

 

- Download from  <http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html 

- Usage free for research purposes.

 

FOR MORE DETAILS

 

Steinberger Ralf,  Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
<http://langtech.jrc.it/#Publications> http://langtech.jrc.it/#Publications.


 

 

  _____  

From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of JLDLME
Sent: 12 November 2006 18:40
To: CORPORA at HD.UIB.NO
Subject: [Corpora-List] BILINGUAL PARALLEL CORPORA

 

Dear Corpora-List members,

 

I have three questions...

 

Does anyone know if there is any publicly available bilingual, sentence
aligned, freely available corpus involving several languages, namely in
Scandinavian (Finnish, Norwegian, etc.) or Latin languages (Spanish,
Italian, etc.), for bilingual studies ?

 

My second question is: Which would be the requirements to create an
online/desktop software tool for the whole process of a parallel corpora?

 

Finally, do you should consider one million of words (in both languages) a
large or a little bilingual corpus?

 

Any help will be appreciated.

 

 

Regards,

 

 

J. L. DeLucca (in some place of Spain)

 



More information about the Corpora mailing list