[Corpora-List] Character encoding headaches
Mike Maxwell
maxwell at umiacs.umd.edu
Sun Aug 2 22:59:47 UTC 2009
Josep M. Fontana wrote:
> I'm sure many people in the list have experienced the curse of character
> encoding when building corpora
You don't know hyow bad it can get until you have to deal with Indic
encodings :-(
> These documents are giving me a lot of headaches. What I did was to open
> them with OpenOffice Writer (as I'm working with Linux) and save them as
> text. Runing the 'file' command in Linux, I see that the format of the
> text file is "UTF-8 Unicode (with BOM) text".
>
> Since the tools I'm using (Freeling from
> http://garraf.epsevg.upc.es/freeling/) only work well when the text is
> ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 < ...etc.) but I
> get the following error message: "illegal input sequence at position 0".
The BOM will be at position 0. If there's a way to save the files
without the BOM (which is AFAIK irrelevant for UTF-8 anyway), that will
probably solve this problem. Alternatively, you could write a snippet
of code that would cut out the BOM; I believe, but am not sure, that it
will be the first three bytes. You can check with a hex editor.
> After this, I tried a different path. I have access to some Windows
> machines so I saved as text directly from within Word choosing the
> Windows text format. Supposedly, this is /Windows/-1252 encoding and I
> assumed this option was the safest since this is the default text format
> for Word (at least with the installation I used). Then I ran iconv -f
> WINDOWS-1252 -t ISO8859-1 but I still got an error message indicating
> that there was a different problematic character ("illegal input
> sequence at position 41"). This is driving me insane.
You'd have to open the file with either a good text editor (like jEdit
or emacs) which tells you which character (byte) you're at (so you can
tell when you get to the 41st byte) and which allows you to ask for the
code point of the offending character; or else a hex editor. My guess
is that something went wrong in the conversion from UTF-8 to 1252, and
there's an illegal code point. Usually Word warns you about that
problem, but I guess it didn't in this case.
--
Mike Maxwell
What good is a universe without somebody around to look at it?
--Robert Dicke, Princeton physicist
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list