[Corpora-List] Romanian language corpora
Ralf Steinberger
ralf.steinberger at jrc.it
Wed Dec 5 08:24:38 UTC 2007
Dear Mihai,
Both the JRC-Acquis parallel corpus and the DGT-Translation Memory contain
Romanian texts. However, they are not speech transcriptions.
You can find more information on our website and also download them from
there:
http://langtech.jrc.it/
The JRC-Acquis has both full texts in 22 languages (including Romanian) and
~sentence alignments for all 21 language pairs involving Romanian. The
Romanian part of the JRC-Acquis consists of about 20 Million words.
DGT-TM is a Translation Memory involving the same 22 languages, i.e. it is a
loose collection of translation units (mostly sentences). From these, the
full text cannot be reconstructed, but the added value compared to
JRC-Acquis is that the cross-lingual sentence alignments have been verified
manually. The size of the Romanian part is 650,000 translation units
(~sentences).
I hope this helps.
All the best,
Ralf
Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it)
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it)
JRC-Acquis Multilingual Parallel Corpus (Version 3)
* Freely available for research purposes.
* 22 languages: Bulgarian, Czech, Danish, German, Greek, English,
Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian,
Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.
* Altogether over 1 Billion words.
* Sentence alignment for 231 language pairs.
* For more information and download, see
<http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html.
The JRCs Language Technology group specialises in the development of highly
multilingual text analysis tools and in cross-lingual applications. Many
applications are accessible online, e.g.:
* <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news
aggregation and analysis (19 languages); allows to navigate the news over
time and across languages; trend analysis; collects information about people
from the news; social network detection.
* <http://press.jrc.it/> NewsBrief: breaking news detection and
display of the very latest thematic news from around the world; email
alerting (22+ languages).
* <http://medusa.jrc.it/> MedISys Medical Information System: latest
health-related news from around the world according to themes and diseases
(22+ languages).
_____
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Mihai Daniel Frumuselu
Sent: 03 December 2007 18:31
To: Corpora at uib.no
Subject: [Corpora-List] Romanian language corpora
Dear Madam/Sir,
I am currently looking for Romanian language corpora in electronic format,
particularly conversation transcriptions. A colleague from Linguist List
advised me to contact you. Do you happen to know there are corpora of
Romanian, either online or on a disk?
Thank you and best regards,
Mihai Frumuselu
Mihai Daniel Frumuselu
doktorand i lingvistik
Björnkullaringen 28D
141 51 Huddinge
Tlf.: (08) 42 86 52 31 (hemma)
0704 - 29 85 51 (mobil)
www.mihai.se, www.oru.se/hum/mihai_frumuselu
www.mihaidaniel.myphotoalbum.com
e-post: mihai.frumuselu at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071205/43ea83bb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list