[Corpora-List] Need help with Twitter Corpus

Tue Jun 19 08:37:58 UTC 2012

Hello everybody, I am in need of a computer linguist. I have a 700.000 
words large Twitter corpus that was automatically assembled by streaming 
the public Twitter Statuses and afterwards automatically searched for 
English-only tweets by taking the language code the users themselves 
claimed to use. The problem is, that ca. 50% of the corpus are 
non-English tweets which claim to be English tweets, so if I am doing a 
statistic with that corpus the results will inevitably be flawed. Is 
there another possibility to filter out the non-English Tweets 
automatically, because I don't have the time to do it manually, it's 
impossible.
If anybody wants to help me. The corpora can be downloaded here 
<https://dl.dropbox.com/u/9674510/christine2.txt> and here 
<https://dl.dropbox.com/u/9674510/neue_daten.txt>, they are simple txt 
files. I would greatly appreciate that. Thanks in advance.
Christine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120619/e9c3a393/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora