[Corpora-List] Need help with Twitter Corpus

Paul McNamee paul.mcnamee at jhuapl.edu
Tue Jun 19 15:55:52 UTC 2012


Neither is Devanagari.  The numbers refer cited below are extracted
from Table 4 of the referenced paper, and are 3-way classification
accuracy.  Within each writing system (i.e., Arabic script, Devanagari
script, and Cyrillic script), there are three languages to chose
between.  (See paper for details.)

So for Arabic, the numbers refer to selection amongst Arabic, Farsi,
and Urdu. For Devanagari, between Hindi, Nepali, and Marathi. And for
Cyrillic, between Russian, Bulgarian, and Ukrainian tweets.

Hope that clarifies this issue.

- Paul


On Tue, 19 Jun 2012, Hristo Tanev wrote:

> ....only that Cyrillic is not a language.
>
> Hristo Tanev
>
>
> ________________________________
> From: Benjamin Van Durme <vandurme at cs.jhu.edu>
> To: Christine Amling <chamling at students.uni-mainz.de> 
> Cc: corpora at uib.no 
> Sent: Tuesday, 19 June 2012, 16:05
> Subject: Re: [Corpora-List] Need help with Twitter Corpus
> 
> The following presents a new LID method, and includes a comparison
> against a number of tools on Twitter data.
>
> Language Identification for Creating Language-Specific Twitter Collections
> Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
> http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
>
> Accuracy numbers (with most other systems run black-box without
> adaptation, so take these conservatively) :
>
>                  Arabic        Devanagari      Cyrillic
> TextCat           96.3          89.1            90.3
> Google CLD        90.5          NA              91.4
> Lui/Baldwin       91.4          78.4            88.8
> PPM - (new)        97.6          97.1            95.8
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list