[Corpora-List] Character encoding headaches

Mike Maxwell maxwell at umiacs.umd.edu
Sun Aug 2 22:59:47 UTC 2009


Josep M. Fontana wrote:
> I'm sure many people in the list have experienced the curse of character 
> encoding when building corpora 

You don't know hyow bad it can get until you have to deal with Indic 
encodings :-(

> These documents are giving me a lot of headaches. What I did was to open 
> them with OpenOffice Writer (as I'm working with Linux) and save them as 
> text. Runing the 'file' command in Linux, I see that the format of the 
> text file is "UTF-8 Unicode (with BOM) text".
 >
> Since the tools I'm using (Freeling from 
> http://garraf.epsevg.upc.es/freeling/) only work well when the text is 
> ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 <  ...etc.) but I 
> get the following error message: "illegal input sequence at position 0".

The BOM will be at position 0.  If there's a way to save the files 
without the BOM (which is AFAIK irrelevant for UTF-8 anyway), that will 
probably solve this problem.  Alternatively, you could write a snippet 
of code that would cut out the BOM; I believe, but am not sure, that it 
will be the first three bytes.  You can check with a hex editor.

> After this, I tried a different path. I have access to some Windows 
> machines so I saved as text directly from within Word choosing the 
> Windows text format. Supposedly, this is /Windows/-1252 encoding and I 
> assumed this option was the safest since this is the default text format 
> for Word (at least with the installation I used). Then I ran iconv -f  
> WINDOWS-1252 -t ISO8859-1 but I still got an error message indicating 
> that there was a different problematic character ("illegal input 
> sequence at position 41"). This is driving me insane.

You'd have to open the file with either a good text editor (like jEdit 
or emacs) which tells you which character (byte) you're at (so you can 
tell when you get to the 41st byte) and which allows you to ask for the 
code point of the offending character; or else a hex editor.  My guess 
is that something went wrong in the conversion from UTF-8 to 1252, and 
there's an illegal code point.  Usually Word warns you about that 
problem, but I guess it didn't in this case.
-- 
    Mike Maxwell
    What good is a universe without somebody around to look at it?
    --Robert Dicke, Princeton physicist

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list