Mystery Language

James M Tonn (jtonn@Princeton.EDU) jtonn at PRINCETON.EDU
Fri Jul 14 14:58:37 UTC 2006


This system has no knowledge of syntax or lexicon. It looks like it works by breaking down text into a list of short character clusters and comparing them with existing lists of character clusters that have been found to occur frequently in a particular language. When you enter text the system searches for the language in which your character clusters have the best overall frequency. If you enter just "rz", it's going to look for a language where that string has the best frequency. Not surprisingly, it turns out to be Polish. "th" is apparently a frequent cluster in Vietnamese (more frequent than in English), but when you add an "e" the overall freqency is better in English. (It's also looking at the frequencies of the individual letters but longer strings have a bigger effect on the final determination.)

So the system will work better the more text you provide. And if our "mystery language" has been written based on how a foreigner thought it sounded (even if it's underlyingly a real language), then that's pretty much the worst kind of data you can feed to this system--it has been "trained" to understand clusters pulled from Internet message postings, and will only work if the entered text is written using the same alphabet or transliteration.

Jim Tonn

----- Original Message -----
From: Josh Wilson <jwilson at ALINGA.COM>
Date: Friday, July 14, 2006 3:37 am
Subject: Re: [SEELANGS] Mystery Language
To: SEELANGS at LISTSERV.CUNY.EDU

> Some other amusing results:
> 
> "My name is Bob" - Middle Frisian
> "Do you speak English? Yes, I speak English." - Unknown Language
> 
> Also, almost any entry involving obscenity is "guessed" as being 
> Scots!! 
> 
> I would be interested to know what sort of criteria this program 
> is using.  
> 
> 
> -----Original Message-----
> From: Slavic & East European Languages and Literature list
> [mailto:SEELANGS at listserv.cuny.edu] On Behalf Of Alina Israeli
> Sent: Thursday, July 13, 2006 4:50 PM
> To: SEELANGS at listserv.cuny.edu
> Subject: Re: [SEELANGS] Mystery Language
> 
> >There is one minor problem with this page and the results. I went to
> >the page and pasted in the English 'Over my dead body' just for fun.
> >The result was 'Manx'. So it seems that Language Guesser doesn't
> >guess very well.
> 
> It needs some work, obviously. Here are my tries and results:
> 
> ja govorju - Serbian
> 
> ia govoriu - Croatian
> 
> over my dead body - Manx
> 
> over my - English
> 
> dead body - Welsh
> 
> __________________________
> Alina Israeli
> LFS, American University
> 4400 Mass. Ave., NW
> Washington, DC 20016
> 
> phone:    (202) 885-2387
> fax:      (202) 885-1076 
> 
> -------------------------------------------------------------------
> ------
> Use your web browser to search the archives, control your 
> subscription  options, and more.  Visit and bookmark the SEELANGS 
> Web Interface at:
>                    http://seelangs.home.comcast.net/
> -------------------------------------------------------------------
> ------
> 
> -------------------------------------------------------------------
> ------
> Use your web browser to search the archives, control your 
> subscription  options, and more.  Visit and bookmark the SEELANGS 
> Web Interface at:
>                    http://seelangs.home.comcast.net/
> -------------------------------------------------------------------
> ------
> 

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                    http://seelangs.home.comcast.net/
-------------------------------------------------------------------------



More information about the SEELANG mailing list