[Corpora-List] Need help with Twitter Corpus

McNamee, Paul Paul.McNamee at jhuapl.edu
Tue Jun 19 17:02:38 UTC 2012


[Resending, since my first response didn't show up.  Apologies if this
ends up being a duplicate.]

The numbers below were extracted from Table 4 of the referenced paper,
and are 3-way classification accuracy.  Within each writing system
(i.e., Arabic script, Devanagari script, and Cyrillic script), there
are three languages to chose between.  The paper has the details.

So for Arabic, the numbers refer to selection amongst Arabic, Farsi,
and Urdu. For Devanagari, between Hindi, Nepali, and Marathi. And for
Cyrillic, between Russian, Bulgarian, and Ukrainian tweets.

Hope that clarifies this issue.

- Paul

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Taras [taras8055 at gmail.com]
Sent: Tuesday, June 19, 2012 12:07 PM
To: corpora at uib.no
Subject: Re: [Corpora-List] Need help with Twitter Corpus

Yes, it seems like a script detection rather than language detection. I also wonder if the notion of Arabic here includes languages who use scripts based on this one (say, Urdu, Persian and some other).

Taras


On 19/06/12 16:39, Hristo Tanev wrote:
....only that Cyrillic is not a language.

Hristo Tanev

________________________________
From: Benjamin Van Durme <vandurme at cs.jhu.edu><mailto:vandurme at cs.jhu.edu>
To: Christine Amling <chamling at students.uni-mainz.de><mailto:chamling at students.uni-mainz.de>
Cc: corpora at uib.no<mailto:corpora at uib.no>
Sent: Tuesday, 19 June 2012, 16:05
Subject: Re: [Corpora-List] Need help with Twitter Corpus

The following presents a new LID method, and includes a comparison
against a number of tools on Twitter data.

Language Identification for Creating Language-Specific Twitter Collections
Shane Bergsma, Paul McNamee, Mossaab Bagdouri, Clayton Fink, Theresa Wilson
http://aclweb.org/anthology-new/W/W12/W12-2108.pdf

Accuracy numbers (with most other systems run black-box without
adaptation, so take these conservatively) :

                Arabic        Devanagari      Cyrillic
TextCat          96.3          89.1            90.3
Google CLD        90.5          NA              91.4
Lui/Baldwin      91.4          78.4            88.8
PPM - (new)        97.6          97.1            95.8

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora






_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no<mailto:Corpora at uib.no>
http://mailman.uib.no/listinfo/corpora



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list