[Corpora-List] charset identifier

Mike Maxwell maxwell at umiacs.umd.edu
Sun Apr 17 12:56:06 UTC 2011


On 4/16/2011 6:51 AM, Joerg Tiedemann wrote:
> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.

You might have a look at Kevin Scannel's site:
    http://borel.slu.edu/crubadan/stadas.html
Not so much about character set identification as language ID.  I'm not 
sure what he does about character codes, although I suppose one could 
create multiple clusters for a single language that uses multiple 
encoding systems.  We did something like that some years back in the 
TIDES Surprise Language exercise for Hindi, where there were multiple 
proprietary encodings on the web, and very little Unicode-encoded text. 
  Perhaps the situation has improved since then!
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu
	"My definition of an interesting universe is
	one that has the capacity to study itself."
         --Stephen Eastmond

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list