[Corpora-List] Seeking for a free comparable corpus

hosein azarbonyad hosein.azarbonyad at yahoo.com
Mon Jun 16 07:28:51 UTC 2014


Dear Kristian,

Building a comparable corpus has many challenges and problems such as covering all domains, filtering non useful documents, concerns about high coverage of vocabulary in both languages and so on. Since my main research is not building a comparable corpus, I don't have enough knowledge and time to construct a corpus with high quality. I just want to use a comparable corpus in my research and as I said dealing with challenges of constructing a comparable corpus is something that I couldn't investigate on it. 

 
Best Regards,
Hosein Azarbonyad


On Monday, June 16, 2014 10:33 AM, Kristian Kankainen <kristian at eki.ee> wrote:
 


Hosein,

Its not entirely clear to me, what kind of corpus structure you look
    for. You can read about downloading different portions (or its
    entirety) of different language Wikipedias here: http://en.wikipedia.org/wiki/Wikipedia:Database_download

Since they have also the interwiki language links separately, I am
    sure it is possible and quite straightforward to compose a corpus
    with a structure of your own liking.

A small notice thought. Since all texts of Wikipedia are licensed
    with copyleft licenses, then any derived corpus must also be with
    same kind of copyleft license. I personally find this the right way
    to forward science.

Best wishes
Kristian K


15.06.2014 09:07, hosein azarbonyad kirjutas:

As I recall there are so many papers used Wikipedia articles as comparable corpora in CLIR. Because in CLIR there is no need to have documents that are exact translations of each other. For our task, a collection of topically related aligned documents is enough. However, I couldn't find any free comparable corpus which is extracted from Wikipedia. Is there any free corpus extracted from Wikipedia? I know there are some comparable corpora in CLEF datasets but they aren't free. 
>
> 
>Best Regards,
>Hosein Azarbonyad
>
>
>
>On Sunday, June 15, 2014 8:42 AM, Joel Nothman <joel at it.usyd.edu.au> wrote:
> 
>
>
>Perhaps the most characteristic feature of Wikipedia is its long tail, and the apparently different features (and editorial behaviour?) of the tail and head. What is true of the most important/popular articles may rarely be true of the majority (it's unclear which we care about in this case). For example, our work in entity type classification has compared training and testing on a random or a "popular" sample, each of about 2000 articles altogether. A random model achieves 92% F1 over popular articles, but the reverse only yields 75%, although random can learn random to 90% F1. This is mostly indicative of type distributions, but no doubt editing patterns face similar discrepancies. 
>
>
>Therefore I might guess that the more universally popular articles like [[Tennis]] are going to appear different, while the plethora of more minor entries (e.g. bands, corporations) are likely to have clearer parallels. 
>
>
>Additionally, there will be divergence after translation (notably restructuring in the most popular articles of actively edited Wikipedias) which makes cognates (may I?) hard to identify from the current pages. Thus "clicking a random sample of languages for the page on tennis" may be made more precise if one compares a foundational edit, or perhaps the historical edit that introduced the largest portion of text to a page, to the state of the English Wikipedia equivalent at that time. However, the example of Japanese tennis in 2004 compared to English is not very suggestive. 
>
>
>And I recall Elena Filatova did some pioneering work in computationally exploiting parallels and differences in multilingual Wikipedia.
>
>
>
>Cheers,
>
>
>Joel Nothman
>School of IT
>University of Sydney
>
>
>On 15 June 2014 11:58, Francis Bond <bond at ieee.org> wrote:
>
>G'day.
>>
>>
>>> No, articles from Wikipedia in
                                  different languages are NOT a
                                  comparable
>>> corpus, for many reasons
>>>
>>
>>> First, most of the time they are a (more or less free) translation of a
>>> master/initial one.
>>
>>
Do you have a citation for this?   As far as I know it is not
>>generally true, pages are written pretty
                                much entirely independently
>>(at least for the English and Japanese
                                Wikipedias which I am
>>experienced with).  I also clicked a
                                random sample of languages for
>>the page on tennis, and they are all
                                very differently structured.
>>
>>I seem to recall a shared task on
                                aligning sentences in wikipedia
>>articles that found them not at all
                                similar, but I am afraid I can't
>>find the paper: does anyone else recall
                                it?
>>
>>--
>>Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
>>Division of Linguistics and
                                    Multilingual Studies
>>Nanyang Technological University
>> 
>>
>>_______________________________________________
>>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>Corpora mailing list
>>Corpora at uib.no
>>http://mailman.uib.no/listinfo/corpora
>>
>
>
>_______________________________________________
>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora
>
>
>
>
>_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora 


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140616/60b7bdf7/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list