[Corpora-List] determining the correct character encoding

Alexander Schutz goalscoringsuperstarhero at gmail.com
Mon Oct 10 12:08:12 UTC 2005


Dear List,

I was wondering whether there exist some Java-class that deals
adequately with determining the correct character encoding for a
given text.
Formerly I was using the shell tool "file" as a perl system call, in order
to identify the source encoding, which was the input for "iconv", but
ever since I switched to Java, character encodings are really buggin
me. For instance, when I extract the body text of some websites from
the web, their character encoding may differ
(mainly between ISO-8859-1 and UTF-8). However, internally, I'd like
to deal with UTF-8 only, so I need a convenient way to transform from
ISO-8859-1 to UTF-8. The InputStreamReader class provides the means
for that undertaking, still I need to specify the original charset. For
once,
I could try to get the information from the HTML source code, but then,
this is not specified all the time. Now in Java-terms, is there a way to
know which charset for a text is used by looking at the text only?
Did anybody encounter that kind of problem before? (anyone? maybe
the web-as-corpus guys?)
Anyways, your help would be very much appreciated,
thanks a million in advance,
Alex
--
Alexander Schutz
Student of Computational Linguistics
University of Saarland, Germany
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20051010/c4bd88e4/attachment.htm>


More information about the Corpora mailing list