[Corpora-List] Corpora for language identification training?
Eric Atwell
eric at comp.leeds.ac.uk
Thu Apr 19 09:24:50 UTC 2007
Hi Dean,
Serge Sharoff at Leeds has collected comparable 100-million-word
web-as-corpus corpora for several languages, see
http://corpus.leeds.ac.uk/internet.html
- you can't directly download the text corpora,
since each web-file can only be cached locally to avoid copyright
infringement; but you CAN download the list of URLs and then run a program to
re-create the corpora yourself.
Not sure if this has been directly used in comparative evaluation of
language identification systems. Try asking Google research labs
http://labs.google.com/faq.html#contact
good luck
eric
Eric Atwell,
Senior Lecturer, Language research group, School of Computing
Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
TEL: 0113-3435430 FAX: 0113-3435468 WWW/email: google Eric Atwell
On Thu, 19 Apr 2007, Dean Jones wrote:
> Hello all,
>
> I'd like to train a classifier to perform language identification,
> and, before I go ahead and create a corpus for this purpose, I'd like
> to ask whether anyone on this list knows of anything suitable. The
> main reason I'm asking is that I'm particularly interested in finding
> something which has been used in the comparative evaluation of
> language identification systems. Languages that we'd initially like to
> cover are English, French, Italian, German and Spanish. Thanks for any
> help,
>
> Best wishes,
>
> Dean.
>
More information about the Corpora
mailing list