[Corpora-List] Need help with Twitter Corpus
Benjamin Van Durme
vandurme at cs.jhu.edu
Tue Jun 19 14:05:51 UTC 2012
The following presents a new LID method, and includes a comparison
against a number of tools on Twitter data.
Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
Accuracy numbers (with most other systems run black-box without
adaptation, so take these conservatively) :
Arabic Devanagari Cyrillic
TextCat 96.3 89.1 90.3
Google CLD 90.5 NA 91.4
Lui/Baldwin 91.4 78.4 88.8
PPM - (new) 97.6 97.1 95.8
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list