[Corpora-List] Seeking for a free comparable corpus
Francis Bond
bond at ieee.org
Sun Jun 15 01:58:25 UTC 2014
G'day.
> No, articles from Wikipedia in different languages are NOT a comparable
> corpus, for many reasons
>
> First, most of the time they are a (more or less free) translation of a
> master/initial one.
Do you have a citation for this? As far as I know it is not
generally true, pages are written pretty much entirely independently
(at least for the English and Japanese Wikipedias which I am
experienced with). I also clicked a random sample of languages for
the page on tennis, and they are all very differently structured.
I seem to recall a shared task on aligning sentences in wikipedia
articles that found them not at all similar, but I am afraid I can't
find the paper: does anyone else recall it?
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list