<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hello everybody, I am in need of a computer linguist. I have a
700.000 words large Twitter corpus that was automatically assembled
by streaming the public Twitter Statuses and afterwards
automatically searched for English-only tweets by taking the
language code the users themselves claimed to use. The problem is,
that ca. 50% of the corpus are non-English tweets which claim to be
English tweets, so if I am doing a statistic with that corpus the
results will inevitably be flawed. Is there another possibility to
filter out the non-English Tweets automatically, because I don't
have the time to do it manually, it's impossible.<br>
If anybody wants to help me. The corpora can be downloaded <a
href="https://dl.dropbox.com/u/9674510/christine2.txt">here</a>
and <a href="https://dl.dropbox.com/u/9674510/neue_daten.txt">here</a>,
they are simple txt files. I would greatly appreciate that. Thanks
in advance.<br>
Christine<br>
</body>
</html>