[Corpora-List] multilingual comparable corpora
Linguistic Data Consortium
ldc at ldc.upenn.edu
Wed Feb 2 17:17:49 UTC 2005
I would just like to make a correction to the earlier post. You do not
need to be a member of the LDC to license the TDT and Broadcast News data.
A few LDC corpora that fit the bill include:
LDC94T5 ECI Multilingual Text
LDC94T4A UN Parallel Text (Complete)
LDC95T20 Hansard French/English
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T57 TDT3 Multilanguage Text Version 2.0
LDC2004T08 Hong Kong Parallel Text - note - this does require membership
LDC2004T18 Arabic English Parallel News Part 1
Information on the above is available at:
http://www.ldc.upenn.edu/Catalog/ByYear.jsp
Best,
Ilya
pascale at cs.ust.hk wrote:
>Try TDT data and Broadcast News from the LDC. You must be an LDC member to
>license the corpora.
>
>However, be reminded that these "comparable" corpora still need to be
>topic aligned to make them really comparable as they contain both on-topic
>and off-topic documents (i.e. documents not on the same topic and
>therefore not comparable).
>
>Our paper on "Mining very non parallel corpora: Parallel sentence and
>lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
>2004 describes our methodology and contains some usefual references.
>
>Regards,
>Pascale
>
>
>>hi all,
>>
>>are there multilingual comparable corpora suitable for research on
>>paraphrases ?
>>for instance, two collections of articles from different sources
>>describing
>>same events *and* in different languages .
>>
>>Any suggestions on how to build this kind of resources would be helpful
>>too.
>>
>>thank you,
>>Grazia
>>
>>
>>
>>
>
>
>
>
--
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: (215) 573-1275
3600 Market Street Fax: (215) 573-2175
Suite 810 email: ldc at ldc.upenn.edu
Philadelphia, PA 19104 www: http://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050202/37f8e245/attachment.htm>
More information about the Corpora
mailing list