[Corpora-List] free corpus

Ralf Steinberger ralf.steinberger at jrc.it
Fri Nov 23 08:19:26 UTC 2007


Hello Peter,

 

You did not say which languages you were looking for. The JRC-Acquis is a
parallel corpus in 22 EU languages. It is freely available for research
purposes. You find it at  <http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html. You can download a single language
or all, with or without alignment information.

 

It is not enormous for single languages (about 1 Billion words altogether
for the 22 languages), but to our knowledge it is the biggest parallel
corpus, considering the number of languages is covers. I hope this is useful
for your purposes.

 

All the best,

 

Ralf 

 

 

Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it)

European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it) 

JRC-Acquis Multilingual Parallel Corpus (Version 3)

*       Freely available for research purposes.

*       22 languages: Bulgarian, Czech, Danish, German, Greek, English,
Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian,
Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.

*       Altogether over 1 Billion words.

*       Sentence alignment for 231 language pairs.

*       For more information and download, see
<http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html.

 


The JRC's Language Technology group specialises in the development of highly
multilingual text analysis tools and in cross-lingual applications. Many
applications are accessible online, e.g.:

*        <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news
aggregation and analysis (19 languages); allows to navigate the news over
time and across languages; trend analysis; collects information about people
from the news; social network detection.

*        <http://press.jrc.it/> NewsBrief: breaking news detection and
display of the very latest thematic news from around the world; email
alerting (22+ languages).

*        <http://medusa.jrc.it/> MedISys Medical Information System: latest
health-related news from around the world according to themes and diseases
(22+ languages).

 

  _____  

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Peter Isaev
Sent: 19 November 2007 12:38
To: CORPORA at uib.no
Subject: [Corpora-List] free corpus

 

Hello.

I'm looking for free big corpus, consisting of plain text, something like
BNC corpus (it is not free).

Where can I download it?

Thank you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071123/8c3add12/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list