[Corpora-List] charset identifier
Joerg Tiedemann
jorg.tiedemann at lingfil.uu.se
Sat Apr 16 10:51:21 UTC 2011
Can someone point me to reliable (freely available) tools for
character set identification?
I would like to have a rather universal tool that can give me the used
char encoding for a given text and given the expected language of that
text.
(Possibly with confidence values if available.)
I know about these tools (but I would also appreciate any comments
about their quality):
enca: http://gitorious.org/enca
This is exactly what I need but does not support a lot of languages.
Maybe someone knows how to extend it with more languages/encodings?
utrac: http://utrac.sourceforge.net/
I haven't tested it but it seems to be quite restricted as well.
cpdetector: http://cpdetector.sourceforge.net/
https://github.com/goerz/convert_encoding.py
includes a "guess encoding option":
These tools do not seem to be freely available:
http://www.lingua-systems.com/language-identifier/lidc-application/
http://www.lingua-systems.com/unicode-converter/autouniconv-library/
The standard unix tool 'file' is of course also sometimes helpful but
too restricted.
Is there anything else (that I can use without training specific models myself)?
Thanks!
Jörg
--
**********************************************************************************
Jörg Tiedemann jorg.tiedemann at lingfil.uu.se
Dep. of Linguistics and Philology
http://stp.lingfil.uu.se/~joerg/
Uppsala University tel: +46 (0)18 - 471 1412
Box 635, SE-751 26 Uppsala/SWEDEN fax: +46 (0)18 - 471 1094
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list