[Corpora-List] Need help with Twitter Corpus
Hristo Tanev
htanev at yahoo.co.uk
Tue Jun 19 15:39:31 UTC 2012
....only that Cyrillic is not a language.
Hristo Tanev
________________________________
From: Benjamin Van Durme <vandurme at cs.jhu.edu>
To: Christine Amling <chamling at students.uni-mainz.de>
Cc: corpora at uib.no
Sent: Tuesday, 19 June 2012, 16:05
Subject: Re: [Corpora-List] Need help with Twitter Corpus
The following presents a new LID method, and includes a comparison
against a number of tools on Twitter data.
Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
Accuracy numbers (with most other systems run black-box without
adaptation, so take these conservatively) :
Arabic Devanagari Cyrillic
TextCat 96.3 89.1 90.3
Google CLD 90.5 NA 91.4
Lui/Baldwin 91.4 78.4 88.8
PPM - (new) 97.6 97.1 95.8
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120619/4bde7a60/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list