[Corpora-List] Character encoding headaches
Ciarán Ó Duibhín
ciaran at oduibhin.freeserve.co.uk
Thu Aug 6 22:41:51 UTC 2009
I wrote earlier:
<< In my experience, it is difficult to get Word to output proper Windows
1252, even when it claims to be doing so, but you end up with ISO8859-1
instead... To explain about Windows 1252 and ISO8859-1; the only difference
is that 1252 has a few extra characters, things like left and right
quotation marks and long dashes. These are encoded with bytes which are
unused in ISO8859-1. The rest of the encoding is the same. Saving as text
from Word actually converts the few extra characters of 1252 to the
"nearest" ones of ISO8859-1 (straight quotation marks, hyphens).>>
This misbehaviour of "save as text" is true of Word 2000, but, following
further exchange of information with Josep Fontana, I have learned that it
appears to be fixed in Word 2003.
That is, in Word 2003, I understand that "save as encoded text: West
European (Windows 1252)" retains these characters, while "save as encoded
text: West European (ISO 8859-1)" converts them; and simple "save as text"
retains them also, all of which is very sensible.
In Word 2000, all these options convert the characters. A workaround is to
copy the whole text from Word 2000 to the clipboard, and paste it to a
Unicode-capable plain text editor which has the facility to convert it to
CP1252 (without turning dashes into hyphens etc). UltraEdit is one such
editor.
Ciarán Ó Duibhín.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list