[Corpora-List] Need help with Twitter Corpus
Taras
taras8055 at gmail.com
Tue Jun 19 16:07:53 UTC 2012
Yes, it seems like a script detection rather than language detection. I
also wonder if the notion of Arabic here includes languages who use
scripts based on this one (say, Urdu, Persian and some other).
Taras
On 19/06/12 16:39, Hristo Tanev wrote:
> ....only that Cyrillic is not a language.
>
> Hristo Tanev
>
> ------------------------------------------------------------------------
> *From:* Benjamin Van Durme <vandurme at cs.jhu.edu>
> *To:* Christine Amling <chamling at students.uni-mainz.de>
> *Cc:* corpora at uib.no
> *Sent:* Tuesday, 19 June 2012, 16:05
> *Subject:* Re: [Corpora-List] Need help with Twitter Corpus
>
> The following presents a new LID method, and includes a comparison
> against a number of tools on Twitter data.
>
> Language Identification for Creating Language-Specific Twitter Collections
> Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa
> Wilson
> http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
>
> Accuracy numbers (with most other systems run black-box without
> adaptation, so take these conservatively) :
>
> Arabic Devanagari Cyrillic
> TextCat 96.3 89.1 90.3
> Google CLD 90.5 NA 91.4
> Lui/Baldwin 91.4 78.4 88.8
> PPM - (new) 97.6 97.1 95.8
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120619/a1df1779/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list