[Corpora-List] Need help with Twitter Corpus

Matthew Purver m.purver at qmul.ac.uk
Tue Jun 19 09:12:45 UTC 2012


you could also try the Compact Language Detector library, which is 
pretty straightforward to use:

http://code.google.com/p/chromium-compact-language-detector/

On 19/06/2012 9:57, Diana Maynard wrote:
> Hi Christine
> There are a bunch of Language ID tools you can use for this, e.g.
> TextCat http://textcat.sourceforge.net/TextCat
> Regards
> Diana
>
> On 19/06/2012 09:37, Christine Amling wrote:
>> Hello everybody, I am in need of a computer linguist. I have a 700.000
>> words large Twitter corpus that was automatically assembled by streaming
>> the public Twitter Statuses and afterwards automatically searched for
>> English-only tweets by taking the language code the users themselves
>> claimed to use. The problem is, that ca. 50% of the corpus are
>> non-English tweets which claim to be English tweets, so if I am doing a
>> statistic with that corpus the results will inevitably be flawed. Is
>> there another possibility to filter out the non-English Tweets
>> automatically, because I don't have the time to do it manually, it's
>> impossible.
>> If anybody wants to help me. The corpora can be downloaded here
>> <https://dl.dropbox.com/u/9674510/christine2.txt> and here
>> <https://dl.dropbox.com/u/9674510/neue_daten.txt>, they are simple txt
>> files. I would greatly appreciate that. Thanks in advance.
>> Christine
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 
Matthew Purver - http://www.eecs.qmul.ac.uk/~mpurver/

Lecturer / Postgraduate Admissions Tutor

Interaction, Media and Communication
School of Electronic Engineering and Computer Science
Queen Mary, University of London, London E1 4NS, UK

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list