18.1699, FYI: Freely Available JRC-Acquis Parallel Corpus
LINGUIST Network
linguist at LINGUISTLIST.ORG
Mon Jun 4 18:57:27 UTC 2007
LINGUIST List: Vol-18-1699. Mon Jun 04 2007. ISSN: 1068 - 4875.
Subject: 18.1699, FYI: Freely Available JRC-Acquis Parallel Corpus
Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews: Laura Welcher, Rosetta Project
<reviews at linguistlist.org>
Homepage: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Dan Parker <dan at linguistlist.org>
================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
===========================Directory==============================
1)
Date: 01-Jun-2007
From: Ralf Steinberger < Ralf.Steinberger at jrc.it >
Subject: Freely Available JRC-Acquis Parallel Corpus
-------------------------Message 1 ----------------------------------
Date: Mon, 04 Jun 2007 14:54:33
From: Ralf Steinberger < Ralf.Steinberger at jrc.it >
Subject: Freely Available JRC-Acquis Parallel Corpus
We are pleased to announce a new release of the freely available
multilingual parallel corpus JRC-Acquis (version 3.0). The corpus size has
nearly tripled (totaling over 1 Billion words) and Bulgarian texts have now
been added (thanks to the Romanian Academy of Sciences) so that the
parallel texts are now available in 22 languages.
Size and Format:
- 22 languages (all official EU languages except Irish)
- Average corpus size per language: 28.9 million words + 19 Million words
in annexes, etc.
- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
Languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish,
French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish,
Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.
Text Types:
- Documents on contents, principles and political objectives of the EU Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
Paragraph Alignment:
Paragraph alignment for all 231 language pairs will soon be available for
version 3.0 of the corpus. The following text applies to version 2.2, still
available on the same website:
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
Manual Subject Domain Classification:
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
Use / Download:
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
For More Details:
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma
Erjavec, Dan Tufi, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
http://langtech.jrc.it/#Publications.
The JRC's Language Technology group specialises in the development of
highly multilingual text analysis tools and in cross-lingual applications.
An example is our multilingual (19 languages) news analysis application
NewsExplorer, publicly accessible at http://press.jrc.it/NewsExplorer.
Related JRC developments (both covering 22+ languages):
- NewsBrief (http://press.jrc.it): breaking news detection and display of
the very latest thematic news from around the world;
- Medical Information System MedISys (http://medusa.jrc.it): displays the
latest health-related news from around the world according to themes and
diseases.
Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology
http://langtech.jrc.it, http://press.jrc.it/NewsExplorer
Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics
Translation
-----------------------------------------------------------
LINGUIST List: Vol-18-1699
More information about the LINGUIST
mailing list