[Corpora-List] Corpora for language identification training?

Vlado Keselj vlado at cs.dal.ca
Thu Apr 19 14:23:59 UTC 2007


Hi,

You can find several links relevant to written language identification at:
http://users.cs.dal.ca/~vlado/nlp/#nlp/tc/langid

Here is the URL list as well:

cat:nlp/tc/langid
name:Language identification tools, by Gertjan van Noord (TextCat)
URL:http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

cat:nlp/tc/langid
name:On-line tool by Steve Huffman
URL:http://complingone.georgetown.edu/~langid/

cat:nlp/tc/langid
URL:http://cslu.cse.ogi.edu/HLTsurvey/ch8node9.html
name:Chapter on Automatic Language Identification
description: in <a href="http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html">
    Survey of the State of the Art in Human Language Technology</a> by
    several editors

cat:nlp/tc/langid
URL:http://www.faganfinder.com/translate/identify.php
name:A Language identification tool at Fagan finder

cat:nlp/tc/langid
URL:http://www.translation-guide.com/language_identification.htm
name:Another language identification tool

cat:nlp/tc/langid
URL:http://www.xrce.xerox.com/people/beesley/langid.html
name:Language identifier by Ken Beesley

cat:nlp/tc/langid
URL:http://dis.tpd.tno.nl/druid/lid/lid_index.html
name:DRUID, a language identification tool

cat:nlp/tc/langid
URL:http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag
name:Specifying language excerpts in XML

cat:nlp/tc/langid
URL:http://www-rali.iro.umontreal.ca/ProjetSILC.en.html
name:SILC project at RALI

cat:nlp/tc/langid
URL:http://veristage.com/demo/test3.php
name:Language Identification tool
description: by Veristage; minimum 40 characters

cat:nlp/tc/langid
URL:http://www.sil.org/silewp/2000/001/SILEWP2000-001.html
name:Language identification and IT: Addressing problems of linguistic
     diversity on a global scale
description: by Peter Constable and Gary Simons, SIL International;
     about language tagging

cat:nlp/tc/langid
URL:http://www.usdoj.gov/crt/cor/Pubs/ISpeakCards.pdf
name:Language identification flashcard
description:by US Dept. of Commerce

cat:nlp/tc/langid
URL:http://www.research.microsoft.com/~joshuago/physicslongcomment.ps
name:Comment by J. Goodman on a Physics paper about Language Trees and
     Zipping, which got a lot of press coverage in 2001

cat:nlp/tc/langid
URL:http://www.unhchr.ch/udhr/navigate/alpha.htm
name:Universal Declaration of Human Rights
description:UN, in 363 languages (17 Jun 2004)


--Vlado

On Thu, 19 Apr 2007, Adam Funk wrote:

> [19/04/07 13:35] Dean Jones wrote:
> 
> > Sorry, I wasn't clear. Personally I'm interested in language ID for
> > "written" texts - specifically, email, although others on the list may
> > be interested in spoken language ID, so I wouldn't want to discourage
> > responses about that.
> 
> Here's a tool you might be interested in:
> 
> http://www.let.rug.nl/~vannoord/TextCat/
> 
> 
> along with a list of others:
> 
> http://www.let.rug.nl/~vannoord/TextCat/competitors.html
> 



More information about the Corpora mailing list