[Corpora-List] Need help with Twitter Corpus

Taras taras8055 at gmail.com
Tue Jun 19 16:07:53 UTC 2012


Yes, it seems like a script detection rather than language detection. I 
also wonder if the notion of Arabic here includes languages who use 
scripts based on this one (say, Urdu, Persian and some other).

Taras


On 19/06/12 16:39, Hristo Tanev wrote:
> ....only that Cyrillic is not a language.
>
> Hristo Tanev
>
> ------------------------------------------------------------------------
> *From:* Benjamin Van Durme <vandurme at cs.jhu.edu>
> *To:* Christine Amling <chamling at students.uni-mainz.de>
> *Cc:* corpora at uib.no
> *Sent:* Tuesday, 19 June 2012, 16:05
> *Subject:* Re: [Corpora-List] Need help with Twitter Corpus
>
> The following presents a new LID method, and includes a comparison
> against a number of tools on Twitter data.
>
> Language Identification for Creating Language-Specific Twitter Collections
> Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa 
> Wilson
> http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
>
> Accuracy numbers (with most other systems run black-box without
> adaptation, so take these conservatively) :
>
>                 Arabic        Devanagari      Cyrillic
> TextCat          96.3          89.1            90.3
> Google CLD        90.5          NA              91.4
> Lui/Baldwin      91.4          78.4            88.8
> PPM - (new)        97.6          97.1            95.8
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120619/a1df1779/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list