[Corpora-List] Seeking for a free comparable corpus

Sat Jun 14 19:08:33 UTC 2014

On 6/14/2014 10:31 AM, Diana Santos wrote:
> articles from Wikipedia in different languages are NOT  a comparable
> corpus, for many reasons... most of the time they are a (more or
> less  free) translation of a master/initial one.

Yes.  And I'd also like to add that Wikipedia articles about people,
places, and things in country X are usually more detailed in the
Wikipedia for the language of country X.

In fact, the Wikipedia editors often add a comment to an article
in the English Wikipedia that points to an article in language X
for more detail.  And they sometimes ask for volunteers to translate
some of that material and add it to the English article.

That suggests a challenging research problem:

  1. Develop tools and techniques for comparing Wikipedia articles
     on the same topic in different languages, L1 and L2.

  2. Find phrases, sentences, and paragraphs in the L1 and L2 articles
     that express the same or closely related information.

  3. Find and annotate conflicts or discrepancies between them.

  4. For any information in one article that is missing in the other,
     use some translator to generate a version in the other language.

Such tools could be useful for many purposes beyond updating and
extending the many editions of Wikipedia.  News services and
multinational organizations of any kind could benefit from them.

Steps #1, #2, and #3 would be useful for comparing articles written
in the same language.  They could also compare news stories that
summarize some discovery or innovation with more technical articles
on the same subject.

John

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora