[Corpora-List] Need help with Twitter Corpus
McNamee, Paul
Paul.McNamee at jhuapl.edu
Tue Jun 19 17:02:38 UTC 2012
[Resending, since my first response didn't show up. Apologies if this
ends up being a duplicate.]
The numbers below were extracted from Table 4 of the referenced paper,
and are 3-way classification accuracy. Within each writing system
(i.e., Arabic script, Devanagari script, and Cyrillic script), there
are three languages to chose between. The paper has the details.
So for Arabic, the numbers refer to selection amongst Arabic, Farsi,
and Urdu. For Devanagari, between Hindi, Nepali, and Marathi. And for
Cyrillic, between Russian, Bulgarian, and Ukrainian tweets.
Hope that clarifies this issue.
- Paul
________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Taras [taras8055 at gmail.com]
Sent: Tuesday, June 19, 2012 12:07 PM
To: corpora at uib.no
Subject: Re: [Corpora-List] Need help with Twitter Corpus
Yes, it seems like a script detection rather than language detection. I also wonder if the notion of Arabic here includes languages who use scripts based on this one (say, Urdu, Persian and some other).
Taras
On 19/06/12 16:39, Hristo Tanev wrote:
....only that Cyrillic is not a language.
Hristo Tanev
________________________________
From: Benjamin Van Durme <vandurme at cs.jhu.edu><mailto:vandurme at cs.jhu.edu>
To: Christine Amling <chamling at students.uni-mainz.de><mailto:chamling at students.uni-mainz.de>
Cc: corpora at uib.no<mailto:corpora at uib.no>
Sent: Tuesday, 19 June 2012, 16:05
Subject: Re: [Corpora-List] Need help with Twitter Corpus
The following presents a new LID method, and includes a comparison
against a number of tools on Twitter data.
Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
http://aclweb.org/anthology-new/W/W12/W12-2108.pdf
Accuracy numbers (with most other systems run black-box without
adaptation, so take these conservatively) :
Arabic Devanagari Cyrillic
TextCat 96.3 89.1 90.3
Google CLD 90.5 NA 91.4
Lui/Baldwin 91.4 78.4 88.8
PPM - (new) 97.6 97.1 95.8
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list