[Corpora-List] Seeking for a free comparable corpus

Francis Bond bond at ieee.org
Sun Jun 15 01:58:25 UTC 2014


G'day.

> No, articles from Wikipedia in different languages are NOT a comparable
> corpus, for many reasons
>
> First, most of the time they are a (more or less free) translation of a
> master/initial one.

Do you have a citation for this?   As far as I know it is not
generally true, pages are written pretty much entirely independently
(at least for the English and Japanese Wikipedias which I am
experienced with).  I also clicked a random sample of languages for
the page on tennis, and they are all very differently structured.

I seem to recall a shared task on aligning sentences in wikipedia
articles that found them not at all similar, but I am afraid I can't
find the paper: does anyone else recall it?

-- 
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list