[Corpora-List] Character encoding headaches

Jim Breen jimbreen at gmail.com
Tue Aug 4 23:59:26 UTC 2009


Torsten Marek <marek at ifi.uzh.ch> wrote;
> Am Dienstag, den 04.08.2009, 09:44 +1000 schrieb Jim Breen:
>> "Josep M. Fontana" <josepm.fontana at upf.edu> wrote:
>> [...]
>> > These documents are giving me a lot of headaches. What I did was to open
>> > them with OpenOffice Writer (as I'm working with Linux) and save them as
>> > text. Runing the 'file' command in Linux, I see that the format of the
>> > text file is "UTF-8 Unicode (with BOM) text".
>>
>> Shame on it. A BOM is not only unnecessary for UTF-8 (as the byte-order
>> is always the same), but is technically illegal in UTF-8 encapsulation of
>> Unicode/ISO-10646. I know a lot of Microsoft systems include it in UTF-8
>> files, but I hadn't heard that OpenOffice did too.
>
> if a file is detected to be of type "UTF-8 Unicode (with BOM) text", it
> simply means that it is a UTF-8 file starting with the byte sequence EF
> BB BF. This sequence represents the character U+FEFF, the ZERO WIDTH
> NO-BREAK SPACE, which is a valid Unicode character and not illegal. It
> also doesn't do any real harm because it doesn't show up on the screen.
>
> It is somewhat superfluous, but also serves as a good indicator for
> telling UTF-8 from pretty much every other character encoding,
> especially since ("normal") English text in UTF-8 is (designed to be)
> the same as ASCII.

There is nothing wrong with EFBBBF in a UTF-8 file, as it is a valid
UTF-8 construct. As you say, it is quite superfluous.

> The only thing that would be illegal in UTF-8 is an initial FFFE/FEFF
> byte sequence, but this is never recognized as valid UTF-8 anyway (by
> file, at least).

And that is precisely the problem with many files which are allegedly in
UTF-8 - they often start with FFFE. The common culprits are Microsoft
packages which when asked to save a text file in UTF-8 seem to prepend
the illegal BOM as a matter of course. I frequently have to delete these
bytes before I can process such files, e.g. running them through iconv.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, VCA Secondary School, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list