Corpus paralelo de dominio publico de las 22 lenguas oficiales de la Union Europea (entre ellas el espanol):
Carlos Subirats
carlos.subirats at GMAIL.COM
Sun Jun 10 17:56:39 UTC 2007
------------------- INFOLING --------------------
Lista de distribución sobre lingüística del español (ISSN: 1576-3404): http://elies.rediris.es/infoling/
Envío de información: infoling-request at listserv.rediris.es
EDITORES:
Carlos Subirats Rüggeberg, UAB <carlos.subirats at uab.es>
Mar Cruz Piñol, U. Barcelona <mcruz at ub.edu>
Eulalia de Bobes Soler, U. Abat Oliba-CEU <debobes1 at uao.es>
Equipo de edición: http://elies.rediris.es/infoling/editores.html
Estudios de Lingüística del Español (ELiEs): http://elies.rediris.es
es una red temática de lingüística del español asociada a INFOLING.
---------------------------------------------------------------------
INFOLING: una lista independiente y global
© Infoling Barcelona (España), 2006. Reservados todos los derechos
--------------------------------------------------------------------------------------
Freely Available JRC-Acquis Parallel Corpus
Corpus paralelo de dominio público de las 22 lenguas oficiales de la
Unión Europea (entre ellas el español): con posibilidad de descaga
selectiva de la lengua o lenguas que se necesiten
Descarga: http://langtech.jrc.it/JRC-Acquis.html
Información de Ralph Steinberger, distribuida por Linguist List:
http://linguistlist.org/issues/18/18-1699.html#1
--------------------------------------------------------------------------------------
New release of the freely available multilingual parallel corpus
JRC-Acquis (version 3.0).
The corpus size has nearly tripled (totaling over 1 Billion words
(1.000.000.000 words)) and Bulgarian texts have now been added (thanks
to the Romanian Academy of Sciences) so that the parallel texts are
now available in 22 languages.
Size and Format:
- 22 languages (all official EU languages except Irish)
- Average corpus size per language: 28.9 million words + 19 Million
words in annexes, etc.
- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
Languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek,
Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese,
Polish, Portuguese, Romanian, Slovak, Slovene, SPANISH, Swedish.
Text Types:
- Documents on contents, principles and political objectives of the EU Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
Paragraph Alignment:
Paragraph alignment for all 231 language pairs will soon be available
for version 3.0 of the corpus. The following text applies to version
2.2, still available on the same website:
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
Manual Subject Domain Classification:
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
Use / Download:
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
For More Details:
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Toma
Erjavec, Dan Tufi, Dániel Varga (2006). 'The JRC-Acquis: A
multilingual aligned parallel corpus with 20+ languages'. Proceedings
of the 5th International Conference on Language Resources and
Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. Available at:
http://langtech.jrc.it/#Publications .
The JRC's Language Technology group specialises in the development of
highly multilingual text analysis tools and in cross-lingual
applications. An example is our multilingual (19 languages) news
analysis application NewsExplorer, publicly accessible at
http://press.jrc.it/NewsExplorer .
Related JRC developments (both covering 22+ languages):
- NewsBrief ( http://press.jrc.it): breaking news detection and
display of the very latest thematic news from around the world;
- Medical Information System MedISys (http://medusa.jrc.it ): displays
the latest health-related news from around the world according to
themes and diseases.
Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology
----------------------------------------------------------------------
La edición del mes de septiembre de 2006 de "Unidad en la diversidad. Portal informativo sobre la lengua castellana" publica la historia de Infoling durante sus primeros diez años de existencia:
http://www.unidadenladiversidad.com/opinion/opinion_ant/2006/sep_oct_06/sep_oct_06.htm
Web de "Unidad y diversidad": http://www.unidadenladiversidad.com
Sumario del último volumen de "Unidad y diversidad":
http://www.unidadenladiversidad.com/sumario.htm
----------------------------------------------------------------------
More information about the Infoling
mailing list