[Corpora-List] Corpora for language identification training?

Mike Maxwell maxwell at umiacs.umd.edu
Thu Apr 19 12:07:43 UTC 2007


Dean Jones wrote:
 > I'd like to train a classifier to perform language identification,
 > and, before I go ahead and create a corpus for this purpose, I'd like
 > to ask whether anyone on this list knows of anything suitable.

I presume you're asking about spoken language ID, not ID of language in 
computer-readable texts, nor from images of printed or handwritten text.

There have been a number of evaluations of spoken language ID by NIST. 
You might have a look at this:
   http://www.nist.gov/speech/tests/lang/2003/index.htm
I believe the data for all the evals was provided by the LDC, although a 
quick glance at the LDC catalog (http://www.ldc.upenn.edu/Catalog/) 
didn't show it.
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu



More information about the Corpora mailing list