[Corpora-List] Character encoding headaches

Torsten Marek marek at ifi.uzh.ch
Tue Aug 4 08:28:36 UTC 2009


Am Dienstag, den 04.08.2009, 09:44 +1000 schrieb Jim Breen:
> "Josep M. Fontana" <josepm.fontana at upf.edu> wrote:
> [...]
> > These documents are giving me a lot of headaches. What I did was to open
> > them with OpenOffice Writer (as I'm working with Linux) and save them as
> > text. Runing the 'file' command in Linux, I see that the format of the
> > text file is "UTF-8 Unicode (with BOM) text".
> 
> Shame on it. A BOM is not only unnecessary for UTF-8 (as the byte-order
> is always the same), but is technically illegal in UTF-8 encapsulation of
> Unicode/ISO-10646. I know a lot of Microsoft systems include it in UTF-8
> files, but I hadn't heard that OpenOffice did too.

Hi,

if a file is detected to be of type "UTF-8 Unicode (with BOM) text", it
simply means that it is a UTF-8 file starting with the byte sequence EF
BB BF. This sequence represents the character U+FEFF, the ZERO WIDTH
NO-BREAK SPACE, which is a valid Unicode character and not illegal. It
also doesn't do any real harm because it doesn't show up on the screen. 

It is somewhat superfluous, but also serves as a good indicator for
telling UTF-8 from pretty much every other character encoding,
especially since ("normal") English text in UTF-8 is (designed to be)
the same as ASCII.

The only thing that would be illegal in UTF-8 is an initial FFFE/FEFF
byte sequence, but this is never recognized as valid UTF-8 anyway (by
file, at least).

best,

Torsten

-- 
.: Torsten Marek
.: University of Zurich
.: Institute of Computational Linguistics
.: http://www.cl.uzh.ch/en/tmarek.html




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list