I'm wondering if anyone can point me to practical results on language sub-classification, e.g. Spanish (Latin America vs. U.S. vs. Spain), French (Canada vs. France vs. Belgium vs. ...), etc. What training set sizes are needed for decent performance using standard character n-gram sorts of approaches? Do those approaches, which work well for language ID in general, break down badly once you're working within a single language? I'd be very happy to receive practical comments, refs to the literature, or both. I'm also happy to take replies privately and then summarize to the list if there's interest. <br>
<br>Thanks!<br><br> Philip<br><br>