[Corpora-List] Need help with Twitter Corpus

Leon Derczynski leon at dcs.shef.ac.uk
Tue Jun 19 09:29:59 UTC 2012


Hello Christine,

You could also filter tweets by country of origin (see
http://en.wikipedia.org/wiki/List_of_countries_where_English_is_an_official_language)
or - even better - do some language detection on the tweets. Textcat
is an establish language detection tool, working in Java, and a
researchers in Amsterdam have kindly supplied Textcat models for
Twitter language detection, that can distinguish English from four
others.

 http://ilps.science.uva.nl/resources/twitterlid

If that's no good, recent work on 90-something-way language ID for
tweets also reaches state of the art performance (from Lui & Baldwin):

 https://github.com/saffsd/langid.py

Hope this helps

All the best,


Leon


On 19 June 2012 09:37, Christine Amling <chamling at students.uni-mainz.de> wrote:
> Hello everybody, I am in need of a computer linguist. I have a 700.000 words
> large Twitter corpus that was automatically assembled by streaming the
> public Twitter Statuses and afterwards automatically searched for
> English-only tweets by taking the language code the users themselves claimed
> to use. The problem is, that ca. 50% of the corpus are non-English tweets
> which claim to be English tweets, so if I am doing a statistic with that
> corpus the results will inevitably be flawed. Is there another possibility
> to filter out the non-English Tweets automatically, because I don't have the
> time to do it manually, it's impossible.
> If anybody wants to help me. The corpora can be downloaded here and here,
> they are simple txt files. I would greatly appreciate that. Thanks in
> advance.
> Christine
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Leon R A Derczynski
NLP Research Group

Department of Computer Science
University of Sheffield
Regent Court, 211 Portobello
Sheffield S1 4DP, UK

+44 114 22 21931
http://www.dcs.shef.ac.uk/~leon/

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list