[Corpora-List] charset identifier

Sat Apr 16 22:13:02 UTC 2011

Joerg Tiedemann <jorg.tiedemann <at> lingfil.uu.se> writes:

> 
> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.
> (Possibly with confidence values if available.)
> 
> I know about these tools (but I would also appreciate any comments
> about their quality):

...can't say anything about the quality, sorry. still, i'd go along the path of:
A Composite Approach to Language/Encoding Detection 
[http://www.unicode.org/iuc/iuc19/a322.html], this is implemented in 
Mozilla's Universal Charset Detector 
[http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
, http://www.mozilla.org/projects/intl/detectorsrc.html] (the latter with some 
info on how to build standalone ones from their code.

then, here are some links to projects using the idea:
[http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html], and 
specifically, this one [http://chardet.feedparser.org/] gives confidence values 
(and has a recent enough release date - 2009-11-10).

good luck!

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora