[Corpora-List] Character encoding headaches
Ciarán Ó Duibhín
ciaran at oduibhin.freeserve.co.uk
Mon Aug 3 01:40:54 UTC 2009
Josep M. Fontana wrote:
<< After this, I tried a different path. I have access to some Windows
machines so I saved as text directly from within Word choosing the Windows
text format. Supposedly, this is /Windows/-1252 encoding and I assumed this
option was the safest since this is the default text format for Word (at
least with the installation I used). Then I ran iconv -f WINDOWS-1252 -t
ISO8859-1 but I still got an error message indicating that there was a
different problematic character ("illegal input sequence at position 41").>>
Have a look at the "Windows text" as output from Word, and check that it is
not already in ISO8859-1. In my experience, it is difficult to get Word to
output proper Windows 1252, even when it claims to be doing so, but you end
up with ISO8859-1 instead. This is a nuisance for me, but it may be just
what you want.
You may still expect your output to contain Windows-style linebreaks
(two-bytes, <CR> then <LF>), but it should not be difficult to reduce these
to <LF> only for linux.
[To explain about Windows 1252 and ISO8859-1; the only difference is that
1252 has a few extra characters, things like left and right quotation marks
and long dashes. These are encoded with bytes which are unused in
ISO8859-1. The rest of the encoding is the same. Saving as text from Word
actually converts the few extra characters of 1252 to the "nearest" ones of
ISO8859-1 (straight quotation marks, hyphens).]
Ciarán Ó Duibhín.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list