[Corpora-List] Romanian language corpora

Ralf Steinberger ralf.steinberger at jrc.it
Wed Dec 5 08:24:38 UTC 2007


Dear Mihai,

 

Both the JRC-Acquis parallel corpus and the DGT-Translation Memory contain
Romanian texts. However, they are not speech transcriptions. 

 

You can find more information on our website and also download them from
there:

 

            http://langtech.jrc.it/

 

The JRC-Acquis has both full texts in 22 languages (including Romanian) and
~sentence alignments for all 21 language pairs involving Romanian. The
Romanian part of the JRC-Acquis consists of about 20 Million words. 

 

DGT-TM is a Translation Memory involving the same 22 languages, i.e. it is a
loose collection of translation units (mostly sentences). From these, the
full text cannot be reconstructed, but the added value compared to
JRC-Acquis is that the cross-lingual sentence alignments have been verified
manually. The size of the Romanian part is 650,000 translation units
(~sentences).

 

I hope this helps.

 

All the best,

 

Ralf

 

 

Ralf Steinberger ( <mailto:Ralf.Steinberger at jrc.it> Ralf.Steinberger at jrc.it)

European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it) 

JRC-Acquis Multilingual Parallel Corpus (Version 3)

*       Freely available for research purposes.

*       22 languages: Bulgarian, Czech, Danish, German, Greek, English,
Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian,
Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish.

*       Altogether over 1 Billion words.

*       Sentence alignment for 231 language pairs.

*       For more information and download, see
<http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html.

 


The JRC’s Language Technology group specialises in the development of highly
multilingual text analysis tools and in cross-lingual applications. Many
applications are accessible online, e.g.:

*        <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news
aggregation and analysis (19 languages); allows to navigate the news over
time and across languages; trend analysis; collects information about people
from the news; social network detection.

*        <http://press.jrc.it/> NewsBrief: breaking news detection and
display of the very latest thematic news from around the world; email
alerting (22+ languages).

*        <http://medusa.jrc.it/> MedISys Medical Information System: latest
health-related news from around the world according to themes and diseases
(22+ languages).

 

 

  _____  

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Mihai Daniel Frumuselu
Sent: 03 December 2007 18:31
To: Corpora at uib.no
Subject: [Corpora-List] Romanian language corpora

 

Dear Madam/Sir,

 

I am currently looking for Romanian language corpora in electronic format,
particularly conversation transcriptions. A colleague from Linguist List
advised me to contact you. Do you happen to know there are corpora of
Romanian, either online or on a disk? 

 

Thank you and best regards,

 

Mihai Frumuselu


Mihai Daniel Frumuselu
doktorand i lingvistik

Björnkullaringen 28D
141 51 Huddinge

Tlf.:  (08) 42 86 52 31 (hemma)
        0704 - 29 85 51 (mobil) 

www.mihai.se, www.oru.se/hum/mihai_frumuselu
www.mihaidaniel.myphotoalbum.com 
e-post: mihai.frumuselu at gmail.com 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071205/43ea83bb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list