<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hello everybody, I am in need of a computer linguist. I have a

    700.000 words large Twitter corpus that was automatically assembled

    by streaming the public Twitter Statuses and afterwards

    automatically searched for English-only tweets by taking the

    language code the users themselves claimed to use. The problem is,

    that ca. 50% of the corpus are non-English tweets which claim to be

    English tweets, so if I am doing a statistic with that corpus the

    results will inevitably be flawed. Is there another possibility to

    filter out the non-English Tweets automatically, because I don't

    have the time to do it manually, it's impossible.<br>

    If anybody wants to help me. The corpora can be downloaded <a

      href="https://dl.dropbox.com/u/9674510/christine2.txt">here</a>

    and <a href="https://dl.dropbox.com/u/9674510/neue_daten.txt">here</a>,

    they are simple txt files. I would greatly appreciate that. Thanks

    in advance.<br>

    Christine<br>

  </body>

</html>