[Corpora-List] Need help with Twitter Corpus

Tue Jun 19 15:39:31 UTC 2012

....only that Cyrillic is not a language.

Hristo Tanev

________________________________
 From: Benjamin Van Durme <vandurme at cs.jhu.edu>
To: Christine Amling <chamling at students.uni-mainz.de> 
Cc: corpora at uib.no 
Sent: Tuesday, 19 June 2012, 16:05
Subject: Re: [Corpora-List] Need help with Twitter Corpus

The following presents a new LID method, and includes a comparison
against a number of tools on Twitter data.

Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
http://aclweb.org/anthology-new/W/W12/W12-2108.pdf

Accuracy numbers (with most other systems run black-box without
adaptation, so take these conservatively) :

                 Arabic        Devanagari      Cyrillic
TextCat           96.3          89.1            90.3
Google CLD        90.5          NA              91.4
Lui/Baldwin       91.4          78.4            88.8
PPM - (new)        97.6          97.1            95.8

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120619/4bde7a60/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora