[Corpora-List] charset identifier

Joerg Tiedemann jorg.tiedemann at lingfil.uu.se
Sat Apr 16 10:51:21 UTC 2011


Can someone point me to reliable (freely available) tools for
character set identification?
I would like to have a rather universal tool that can give me the used
char encoding for a given text and given the expected language of that
text.
(Possibly with confidence values if available.)

I know about these tools (but I would also appreciate any comments
about their quality):

enca:  http://gitorious.org/enca
This is exactly what I need but does not support a lot of languages.
Maybe someone knows how to extend it with more languages/encodings?

utrac:  http://utrac.sourceforge.net/
I haven't tested it but it seems to be quite restricted as well.

cpdetector:  http://cpdetector.sourceforge.net/

https://github.com/goerz/convert_encoding.py
includes a "guess encoding option":

These tools do not seem to be freely available:
http://www.lingua-systems.com/language-identifier/lidc-application/
http://www.lingua-systems.com/unicode-converter/autouniconv-library/

The standard unix tool 'file' is of course also sometimes helpful but
too restricted.

Is there anything else (that I can use without training specific models myself)?

Thanks!
Jörg



-- 
**********************************************************************************
 Jörg Tiedemann                                     jorg.tiedemann at lingfil.uu.se
 Dep. of Linguistics and Philology
http://stp.lingfil.uu.se/~joerg/
 Uppsala University                                  tel:  +46 (0)18 - 471 1412
 Box 635, SE-751 26 Uppsala/SWEDEN   fax: +46 (0)18 - 471 1094

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list