[Corpora-List] Within-language language ID

Fri Jan 14 15:21:59 UTC 2011

I'm wondering if anyone can point me to practical results on language
sub-classification, e.g. Spanish (Latin America vs. U.S. vs. Spain), French
(Canada vs. France vs. Belgium vs. ...), etc.   What training set sizes are
needed for decent performance using standard character n-gram sorts of
approaches?  Do those approaches, which work well for language ID in
general, break down badly once you're working within a single language?
I'd be very happy to receive practical comments, refs to the literature, or
both.  I'm also happy to take replies privately and then summarize to the
list if there's interest.

Thanks!

  Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110114/a9a8e66a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora