[Corpora-List] determining the correct character encoding

Mon Oct 10 13:35:23 UTC 2005

I've had fairly good success with the Java port of Mozilla's chardet 
code  http://jchardet.sourceforge.net/
See http://www.mozilla.org/projects/intl/chardet.html for the original 
C++ source code.
You could also look at TextCat (a perl implementation of an n-gram based 
language guesser - you could train it for encodings, but it probably 
isn't nearly as effective as the code above for charset detection) 
http://odur.let.rug.nl/~vannoord/TextCat/

Hope that helps,

dave

Alexander Schutz wrote:

> Dear List,
>
> I was wondering whether there exist some Java-class that deals
> adequately with determining the correct character encoding for a
> given text.
> Formerly I was using the shell tool "file" as a perl system call, in order
> to identify the source encoding, which was the input for "iconv", but
> ever since I switched to Java, character encodings are really buggin
> me. For instance, when I extract the body text of some websites from
> the web, their character encoding may differ
> (mainly between ISO-8859-1 and UTF-8). However, internally, I'd like
> to deal with UTF-8 only, so I need a convenient way to transform from
> ISO-8859-1 to UTF-8. The InputStreamReader class provides the means
> for that undertaking, still I need to specify the original charset. 
> For once,
> I could try to get the information from the HTML source code, but then,
> this is not specified all the time. Now in Java-terms, is there a way to
> know which charset for a text is used by looking at the text only?
> Did anybody encounter that kind of problem before? (anyone? maybe
> the web-as-corpus guys?)
> Anyways, your help would be very much appreciated,
> thanks a million in advance,
> Alex
> -- 
> Alexander Schutz
> Student of Computational Linguistics
> University of Saarland, Germany