[Corpora-List] Fwd: Seeking for a free comparable corpus

Diana Santos dianamsmpsantos at gmail.com
Sun Jun 15 11:14:11 UTC 2014


Hi Francis,
no I don't have a citation for that, if by citation you mean an empirical
study that really measures that.
If you press me, I should really say that my experience is based on
studying the Portuguese wikipedia, and also my main interest was the
cultural domains (and not hard sciences or sports). My impression as a user
is that most pages that have equivalents in other languages (so not most
pages, but most "parallel" pages) in the Portuguese/English or
Norwegian/English pairs have been translated (one way or the other, I mean
in one direction or the other). But this is a subjective impression as a
user.

If you mean citation of papers which discuss some of these subjects or look
at Wikipedia crosslinguistically, I can offer some:
Mota et al. 2012. "Págico: Evaluating Wikipedia-based information retrieval
in Portuguese".
http://www.lrec-conf.org/proceedings/lrec2012/pdf/590_Paper.pdf
Santos et al. 2012. Volume of the Linguamática journal dedicated to Págico
(in Portuguese). http://linguamatica.com/index.php/linguamatica/issue/view/8
Santos et al. 2010. GikiCLEF: Crosscultural issues in multilingual
information access.
http://www.lrec-conf.org/proceedings/lrec2010/pdf/272_Paper.pdf

As to the alignment of Wikipedia articles, I do remember a paper on that at
LREC 2012 (or LREC 2010?) co-authored by Rob Gaizauskas, that as far as I
remember was involved in a EU project that touched upon that.

Diana



2014-06-15 3:58 GMT+02:00 Francis Bond <bond at ieee.org>:

G'day.
>
> > No, articles from Wikipedia in different languages are NOT a comparable
> > corpus, for many reasons
> >
> > First, most of the time they are a (more or less free) translation of a
> > master/initial one.
>
> Do you have a citation for this?   As far as I know it is not
> generally true, pages are written pretty much entirely independently
> (at least for the English and Japanese Wikipedias which I am
> experienced with).  I also clicked a random sample of languages for
> the page on tennis, and they are all very differently structured.
>
> I seem to recall a shared task on aligning sentences in wikipedia
> articles that found them not at all similar, but I am afraid I can't
> find the paper: does anyone else recall it?
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140615/4643667d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list