[Corpora-List] Need help with Twitter Corpus
Paul McNamee
paul.mcnamee at jhuapl.edu
Tue Jun 19 15:55:52 UTC 2012
Neither is Devanagari. The numbers refer cited below are extracted
from Table 4 of the referenced paper, and are 3-way classification
accuracy. Within each writing system (i.e., Arabic script, Devanagari
script, and Cyrillic script), there are three languages to chose
between. (See paper for details.)
So for Arabic, the numbers refer to selection amongst Arabic, Farsi,
and Urdu. For Devanagari, between Hindi, Nepali, and Marathi. And for
Cyrillic, between Russian, Bulgarian, and Ukrainian tweets.
Hope that clarifies this issue.
- Paul
On Tue, 19 Jun 2012, Hristo Tanev wrote:
> ....only that Cyrillic is not a language.
>
> Hristo Tanev
>
>
> ________________________________
> From: Benjamin Van Durme <vandurme at cs.jhu.edu>
> To: Christine Amling <chamling at students.uni-mainz.de>
> Cc: corpora at uib.no
> Sent: Tuesday, 19 June 2012, 16:05
> Subject: Re: [Corpora-List] Need help with Twitter Corpus
>
> The following presents a new LID method, and includes a comparison
> against a number of tools on Twitter data.
>
> Language Identification for Creating Language-Specific Twitter Collections
> Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
> http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
>
> Accuracy numbers (with most other systems run black-box without
> adaptation, so take these conservatively) :
>
> Arabic Devanagari Cyrillic
> TextCat 96.3 89.1 90.3
> Google CLD 90.5 NA 91.4
> Lui/Baldwin 91.4 78.4 88.8
> PPM - (new) 97.6 97.1 95.8
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list