[Corpora-List] Character encoding headaches

Mon Aug 3 23:44:48 UTC 2009

"Josep M. Fontana" <josepm.fontana at upf.edu> wrote:
[...]
> These documents are giving me a lot of headaches. What I did was to open
> them with OpenOffice Writer (as I'm working with Linux) and save them as
> text. Runing the 'file' command in Linux, I see that the format of the
> text file is "UTF-8 Unicode (with BOM) text".

Shame on it. A BOM is not only unnecessary for UTF-8 (as the byte-order
is always the same), but is technically illegal in UTF-8 encapsulation of
Unicode/ISO-10646. I know a lot of Microsoft systems include it in UTF-8
files, but I hadn't heard that OpenOffice did too.

> Since the tools I'm using (Freeling from
> http://garraf.epsevg.upc.es/freeling/) only work well when the text is
> ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 <  ...etc.) but I
> get the following error message: "illegal input sequence at position 0".

Yes, iconv() is strict, and since a BOM (two bytes - either FEFF or FFFE)
is not a legal UTF-8 value, it refuses to go on.

Since you are using Linux, there is a simple solution - use a text editor
such as "vi" or "vim" in a terminal window to remove the BOM, e.g.

vi filename
xx (to delete the first two bytes)
:wq   (to save the modified file and exit)

iconv should work OK now.

Jim

--
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, VCA Secondary School, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora