[Corpora-List] Within-language language ID

Cyril Grouin cyril.grouin at limsi.fr
Sat Jan 15 11:43:04 UTC 2011


Hi Philip,

In 2010, the DEFT French text-mining challenge proposed to identify the country where a newspaper article was published (task 2). 

The corpus was composed of newspaper articles in French from two countries (France vs. Canada) extracted from four newspapers (Le Monde and L'Est Républicain from France, La Presse and Le Devoir from Québec). Two categories of articles were used: general informations vs. sports. The participants have to identify, the country of publication of an article, and then, the newspaper from which the article was extracted.
- Training corpus was composed of 3719 articles (1728 f/ France, 1991 f/ Québec - 1820 sport, 1899 general);
- Test corpus was composed of 2482 articles (1153 f/ France, 1329 f/ Québec - 1216 sport, 1266 general).
An extract from the training corpus is available here: http://www.groupes.polymtl.ca/taln2010/corpus_origine.html

We built the following hypothesis: is the identification of the country more easy for sport articles due to the geographic marking of some sports (football and rugby in France vs. hockey and baseball in Québec)? Results were better for articles dealing with one the four previous sport than for other one.

Proceedings (in French) are available at: http://deft10.limsi.fr/actes_deft.php
Our presentation is here: http://deft10.limsi.fr/actes/deft10_presentation_atelier.pdf

Best regards,
Cyril.


Le 14 janv. 2011 à 16:21, P Resnik a écrit :

> I'm wondering if anyone can point me to practical results on language sub-classification, e.g. Spanish (Latin America vs. U.S. vs. Spain), French (Canada vs. France vs. Belgium vs. ...), etc.   What training set sizes are needed for decent performance using standard character n-gram sorts of approaches?  Do those approaches, which work well for language ID in general, break down badly once you're working within a single language?   I'd be very happy to receive practical comments, refs to the literature, or both.  I'm also happy to take replies privately and then summarize to the list if there's interest.  
> 
> Thanks!
> 
>   Philip
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list