[Corpora-List] Character encoding headaches

Ciarán Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Mon Aug 3 01:40:54 UTC 2009


Josep M. Fontana wrote:
<< After this, I tried a different path. I have access to some Windows 
machines so I saved as text directly from within Word choosing the Windows 
text format. Supposedly, this is /Windows/-1252 encoding and I assumed this 
option was the safest since this is the default text format for Word (at 
least with the installation I used). Then I ran iconv -f  WINDOWS-1252 -t 
ISO8859-1 but I still got an error message indicating that there was a 
different problematic character ("illegal input sequence at position 41").>>

Have a look at the "Windows text" as output from Word, and check that it is 
not already in ISO8859-1.  In my experience, it is difficult to get Word to 
output proper Windows 1252, even when it claims to be doing so, but you end 
up with ISO8859-1 instead.  This is a nuisance for me, but it may be just 
what you want.

You may still expect your output to contain Windows-style linebreaks 
(two-bytes, <CR> then <LF>), but it should not be difficult to reduce these 
to <LF> only for linux.

[To explain about Windows 1252 and ISO8859-1; the only difference is that 
1252 has a few extra characters, things like left and right quotation marks 
and long dashes.  These are encoded with bytes which are unused in 
ISO8859-1.  The rest of the encoding is the same.  Saving as text from Word 
actually converts the few extra characters of 1252 to the "nearest" ones of 
ISO8859-1 (straight quotation marks, hyphens).]

Ciarán Ó Duibhín.



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list