[Corpora-List] Corpora for language identification training?
Mike Maxwell
maxwell at umiacs.umd.edu
Thu Apr 19 12:07:43 UTC 2007
Dean Jones wrote:
> I'd like to train a classifier to perform language identification,
> and, before I go ahead and create a corpus for this purpose, I'd like
> to ask whether anyone on this list knows of anything suitable.
I presume you're asking about spoken language ID, not ID of language in
computer-readable texts, nor from images of printed or handwritten text.
There have been a number of evaluations of spoken language ID by NIST.
You might have a look at this:
http://www.nist.gov/speech/tests/lang/2003/index.htm
I believe the data for all the evals was provided by the LDC, although a
quick glance at the LDC catalog (http://www.ldc.upenn.edu/Catalog/)
didn't show it.
--
Mike Maxwell
maxwell at umiacs.umd.edu
More information about the Corpora
mailing list