[Corpora-List] multilingual comparable corpora

pascale at cs.ust.hk pascale at cs.ust.hk
Wed Feb 2 16:27:46 UTC 2005


Try TDT data and Broadcast News from the LDC. You must be an LDC member to
license the corpora.

However, be reminded that these "comparable" corpora still need to be
topic aligned to make them really comparable as they contain both on-topic
and off-topic documents (i.e. documents not on the same topic and
therefore not comparable).

Our paper on "Mining very non parallel corpora: Parallel sentence and
lexicon extraction by boostraping and EM" (Fung & Cheung 2004) in EMNLP
2004 describes our methodology and contains some usefual references.

Regards,
Pascale
>
>
> hi all,
>
> are there multilingual comparable corpora suitable for research on
> paraphrases ?
> for instance, two collections of articles from different sources
> describing
> same events *and* in different languages .
>
> Any suggestions on how to build this kind of resources would be helpful
> too.
>
> thank you,
> Grazia
>
>



More information about the Corpora mailing list