[Corpora-List] free corpus
Ralf Steinberger
ralf.steinberger at jrc.it
Fri Nov 23 08:19:26 UTC 2007
Hello Peter,
You did not say which languages you were looking for. The JRC-Acquis is a
parallel corpus in 22 EU languages. It is freely available for research
purposes. You find it at <http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html. You can download a single language
or all, with or without alignment information.
It is not enormous for single languages (about 1 Billion words altogether
for the 22 languages), but to our knowledge it is the biggest parallel
corpus, considering the number of languages is covers. I hope this is useful
for your purposes.
All the best,
Ralf
Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it)
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it)
JRC-Acquis Multilingual Parallel Corpus (Version 3)
* Freely available for research purposes.
* 22 languages: Bulgarian, Czech, Danish, German, Greek, English,
Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian,
Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.
* Altogether over 1 Billion words.
* Sentence alignment for 231 language pairs.
* For more information and download, see
<http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html.
The JRC's Language Technology group specialises in the development of highly
multilingual text analysis tools and in cross-lingual applications. Many
applications are accessible online, e.g.:
* <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news
aggregation and analysis (19 languages); allows to navigate the news over
time and across languages; trend analysis; collects information about people
from the news; social network detection.
* <http://press.jrc.it/> NewsBrief: breaking news detection and
display of the very latest thematic news from around the world; email
alerting (22+ languages).
* <http://medusa.jrc.it/> MedISys Medical Information System: latest
health-related news from around the world according to themes and diseases
(22+ languages).
_____
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Peter Isaev
Sent: 19 November 2007 12:38
To: CORPORA at uib.no
Subject: [Corpora-List] free corpus
Hello.
I'm looking for free big corpus, consisting of plain text, something like
BNC corpus (it is not free).
Where can I download it?
Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071123/8c3add12/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list