<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0cm;
margin-bottom:.0001pt;
font-size:10.5pt;
font-family:Consolas;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0cm;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:Consolas;}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-GB link=blue vlink=purple><div class=WordSection1><p class=MsoNormal>Hi Philip<o:p></o:p></p><p class=MsoNormal>I think we met at ACL Maryland 1999…<o:p></o:p></p><p class=MsoNormal><span style='font-family:Wingdings'>J</span><o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>If Constantin Orasan of Wolverhampton Uni agrees, I could send you an unpublished<o:p></o:p></p><p class=MsoNormal>draft of a joint presentation we made at Complex 2001: “<i>Towards the Globalization of Business English?”<o:p></o:p></i></p><p class=MsoNormal>which tried (among other things) to distinguish British and American varieties in the WBE corpus<o:p></o:p></p><p class=MsoNormal>(which contains webpages from Belgium, Hong Kong, Netherlands, Pakistan, Switzerland, UK, USA).<o:p></o:p></p><p class=MsoNormal>We referred to Hofland and Johansson (1982), Leech and Fallon (1992), Mason and Berglund (2001),<o:p></o:p></p><p class=MsoNormal>and Kilgarriff (2001).<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>I think Eric Atwell (Leeds) has been working on this with several of his students over the past few years.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Best<o:p></o:p></p><p class=MsoNormal>Ramesh<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'>Ramesh Krishnamurthy<br>Lecturer in English Studies, School of Languages and Social Sciences,<br>Aston University, Birmingham B4 7ET, UK<br>Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th<br>Floor, North Wing of Main Building]<br><a href="http://www1.aston.ac.uk/lss/staff/krishnamurthyr/"><span style='color:blue'>http://www1.aston.ac.uk/lss/staff/krishnamurthyr/</span></a><br>Director, ACORN (Aston Corpus Network project): <a href="http://acorn.aston.ac.uk/"><span style='color:blue'>http://acorn.aston.ac.uk/</span></a> <o:p></o:p></span></p><p class=MsoNormal><span style='font-size:12.0pt;font-family:"Times New Roman","serif"'><o:p> </o:p></span></p><p class=MsoPlainText>Message: 2<o:p></o:p></p><p class=MsoPlainText>Date: Fri, 14 Jan 2011 10:21:59 -0500<o:p></o:p></p><p class=MsoPlainText>From: P Resnik <<a href="mailto:psresnik@gmail.com">psresnik@gmail.com</a>><o:p></o:p></p><p class=MsoPlainText>Subject: [Corpora-List] Within-language language ID<o:p></o:p></p><p class=MsoPlainText>To: CORPORA <<a href="mailto:CORPORA@uib.no">CORPORA@uib.no</a>><o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>I'm wondering if anyone can point me to practical results on language sub-classification, e.g. Spanish (Latin America vs. U.S. vs. Spain), French<o:p></o:p></p><p class=MsoPlainText>(Canada vs. France vs. Belgium vs. ...), etc. What training set sizes are<o:p></o:p></p><p class=MsoPlainText>needed for decent performance using standard character n-gram sorts of approaches? Do those approaches, which work well for language ID in general, break down badly once you're working within a single language?<o:p></o:p></p><p class=MsoPlainText>I'd be very happy to receive practical comments, refs to the literature, or both. I'm also happy to take replies privately and then summarize to the list if there's interest.<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>Thanks!<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText> Philip<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p></div></body></html>