[Corpora-List] Within-language language ID

Sat Jan 15 13:56:33 UTC 2011

Hi Philip
I think we met at ACL Maryland 1999...
:)

If Constantin Orasan of Wolverhampton Uni agrees, I could send you an unpublished
draft of a joint presentation  we made at Complex 2001: "Towards the Globalization of Business English?"
which tried (among other things) to distinguish British and American varieties in the WBE corpus
(which contains webpages from Belgium, Hong Kong, Netherlands, Pakistan, Switzerland, UK, USA).
We referred to Hofland and Johansson (1982), Leech and Fallon (1992), Mason and Berglund (2001),
and Kilgarriff (2001).

I think Eric Atwell (Leeds) has been working on this with several of his students over the past few years.

Best
Ramesh

Ramesh Krishnamurthy
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK
Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th
Floor, North Wing of Main Building]
http://www1.aston.ac.uk/lss/staff/krishnamurthyr/
Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/

Message: 2

Date: Fri, 14 Jan 2011 10:21:59 -0500

From: P Resnik <psresnik at gmail.com<mailto:psresnik at gmail.com>>

Subject: [Corpora-List] Within-language language ID

To: CORPORA <CORPORA at uib.no<mailto:CORPORA at uib.no>>

I'm wondering if anyone can point me to practical results on language sub-classification, e.g. Spanish (Latin America vs. U.S. vs. Spain), French

(Canada vs. France vs. Belgium vs. ...), etc.   What training set sizes are

needed for decent performance using standard character n-gram sorts of approaches?  Do those approaches, which work well for language ID in general, break down badly once you're working within a single language?

I'd be very happy to receive practical comments, refs to the literature, or both.  I'm also happy to take replies privately and then summarize to the list if there's interest.

Thanks!

  Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110115/d7acf48b/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora