[Corpora-List] Need help with Twitter Corpus

Diana Maynard d.maynard at dcs.shef.ac.uk
Tue Jun 19 08:57:50 UTC 2012


Hi Christine
There are a bunch of Language ID tools you can use for this, e.g. 
TextCat http://textcat.sourceforge.net/TextCat
Regards
Diana

On 19/06/2012 09:37, Christine Amling wrote:
> Hello everybody, I am in need of a computer linguist. I have a 700.000
> words large Twitter corpus that was automatically assembled by streaming
> the public Twitter Statuses and afterwards automatically searched for
> English-only tweets by taking the language code the users themselves
> claimed to use. The problem is, that ca. 50% of the corpus are
> non-English tweets which claim to be English tweets, so if I am doing a
> statistic with that corpus the results will inevitably be flawed. Is
> there another possibility to filter out the non-English Tweets
> automatically, because I don't have the time to do it manually, it's
> impossible.
> If anybody wants to help me. The corpora can be downloaded here
> <https://dl.dropbox.com/u/9674510/christine2.txt> and here
> <https://dl.dropbox.com/u/9674510/neue_daten.txt>, they are simple txt
> files. I would greatly appreciate that. Thanks in advance.
> Christine

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list