[Corpora-List] Seeking for a free comparable corpus

Sun Jun 15 03:41:21 UTC 2014

Perhaps the most characteristic feature of Wikipedia is its long tail, and
the apparently different features (and editorial behaviour?) of the tail
and head. What is true of the most important/popular articles may rarely be
true of the majority (it's unclear which we care about in this case). For
example, our work in entity type classification
<http://downloads.schwa.org/pubs/pdf/aij10wikiner.pdf> has compared
training and testing on a random or a "popular" sample, each of about 2000
articles altogether. A random model achieves 92% F1 over popular articles,
but the reverse only yields 75%, although random can learn random to 90%
F1. This is mostly indicative of type distributions, but no doubt editing
patterns face similar discrepancies.

Therefore I might guess that the more universally popular articles like
[[Tennis]] are going to appear different, while the plethora of more minor
entries (e.g. bands, corporations) are likely to have clearer parallels.

Additionally, there will be divergence after translation (notably
restructuring in the most popular articles of actively edited Wikipedias)
which makes cognates (may I?) hard to identify from the current pages. Thus
"clicking a random sample of languages for the page on tennis" may be made
more precise if one compares a foundational edit, or perhaps the historical
edit that introduced the largest portion of text to a page, to the state of
the English Wikipedia equivalent *at that time*. However, the example
of Japanese
tennis
<http://ja.wikipedia.org/w/index.php?title=%E3%83%86%E3%83%8B%E3%82%B9&oldid=356563>
in
2004 compared to English
<http://en.wikipedia.org/w/index.php?title=Tennis&oldid=2702021> is not
very suggestive.

And I recall Elena Filatova
<http://storm.cis.fordham.edu/~filatova/publications.html> did some
pioneering work in computationally exploiting parallels and differences in
multilingual Wikipedia.

Cheers,

Joel Nothman
School of IT
University of Sydney

On 15 June 2014 11:58, Francis Bond <bond at ieee.org> wrote:

> G'day.
>
> > No, articles from Wikipedia in different languages are NOT a comparable
> > corpus, for many reasons
> >
> > First, most of the time they are a (more or less free) translation of a
> > master/initial one.
>
> Do you have a citation for this?   As far as I know it is not
> generally true, pages are written pretty much entirely independently
> (at least for the English and Japanese Wikipedias which I am
> experienced with).  I also clicked a random sample of languages for
> the page on tennis, and they are all very differently structured.
>
> I seem to recall a shared task on aligning sentences in wikipedia
> articles that found them not at all similar, but I am afraid I can't
> find the paper: does anyone else recall it?
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140615/82a98a65/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora