[Corpora-List] multilingual comparable corpora

Linguistic Data Consortium ldc at ldc.upenn.edu
Wed Feb 2 17:17:49 UTC 2005


I would just like to make a correction to the earlier post.  You do not
need to be a member of the LDC to license the TDT and Broadcast News data.

A few LDC corpora that fit the bill include:

LDC94T5 ECI Multilingual Text
LDC94T4A UN Parallel Text (Complete)
LDC95T20 Hansard French/English
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T57 TDT3 Multilanguage Text Version 2.0
LDC2004T08  Hong Kong Parallel Text - note - this does require membership
LDC2004T18 Arabic English Parallel News Part 1

Information on the above is available at:

http://www.ldc.upenn.edu/Catalog/ByYear.jsp

Best,

Ilya


pascale at cs.ust.hk wrote:

>Try TDT data and Broadcast News from the LDC. You must be an LDC member to
>license the corpora.
>
>However, be reminded that these "comparable" corpora still need to be
>topic aligned to make them really comparable as they contain both on-topic
>and off-topic documents (i.e. documents not on the same topic and
>therefore not comparable).
>
>Our paper on "Mining very non parallel corpora: Parallel sentence and
>lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
>2004 describes our methodology and contains some usefual references.
>
>Regards,
>Pascale
>
>
>>hi all,
>>
>>are there multilingual comparable corpora suitable for research on
>>paraphrases ?
>>for instance, two collections of articles from different sources
>>describing
>>same events *and* in different languages .
>>
>>Any suggestions on how to build this kind of resources would be helpful
>>too.
>>
>>thank you,
>>Grazia
>>
>>
>>
>>
>
>
>
>

--


Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                     Phone: (215) 573-1275
3600 Market Street                             Fax:   (215) 573-2175
Suite 810                             email: ldc at ldc.upenn.edu
Philadelphia, PA 19104                 www: http://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050202/37f8e245/attachment.htm>


More information about the Corpora mailing list