[Corpora-List] Corpora for language identification training?

Thu Apr 19 09:24:50 UTC 2007

Hi Dean,

Serge Sharoff at Leeds has collected comparable 100-million-word
web-as-corpus corpora for several languages, see
http://corpus.leeds.ac.uk/internet.html

- you can't directly download the text corpora,
since each web-file can only be cached locally to avoid copyright
infringement;  but you CAN download the list of URLs and then run a program to
re-create the corpora yourself.

Not sure if this has been directly used in comparative evaluation of
language identification systems. Try asking Google research labs
http://labs.google.com/faq.html#contact

good luck

eric

Eric Atwell, 
Senior Lecturer, Language research group, School of Computing 
Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
TEL: 0113-3435430  FAX: 0113-3435468  WWW/email: google Eric Atwell

On Thu, 19 Apr 2007, Dean Jones wrote:

> Hello all,
>
> I'd like to train a classifier to perform language identification,
> and, before I go ahead and create a corpus for this purpose, I'd like
> to ask whether anyone on this list knows of anything suitable. The
> main reason I'm asking is that I'm particularly interested in  finding
> something which has been used in the comparative evaluation of
> language identification systems. Languages that we'd initially like to
> cover are English, French, Italian, German and Spanish. Thanks for any
> help,
>
> Best wishes,
>
> Dean.
>