[Corpora-List] Need help with Twitter Corpus

Benjamin Van Durme vandurme at cs.jhu.edu
Tue Jun 19 14:05:51 UTC 2012


The following presents a new LID method, and includes a comparison
against a number of tools on Twitter data.

Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
 http://aclweb.org/anthology-new/W/W12/W12-2108.pdf

Accuracy numbers (with most other systems run black-box without
adaptation, so take these conservatively) :

                 Arabic        Devanagari      Cyrillic
TextCat           96.3          89.1            90.3
Google CLD        90.5          NA              91.4
Lui/Baldwin       91.4          78.4            88.8
PPM - (new)        97.6          97.1            95.8

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list