[Corpora-List] Character encoding headaches

Ciarán Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Thu Aug 6 22:41:51 UTC 2009


I wrote earlier:
<< In my experience, it is difficult to get Word to output proper Windows 
1252, even when it claims to be doing so, but you end up with ISO8859-1 
instead...  To explain about Windows 1252 and ISO8859-1; the only difference 
is that 1252 has a few extra characters, things like left and right 
quotation marks and long dashes.  These are encoded with bytes which are 
unused in ISO8859-1.  The rest of the encoding is the same.  Saving as text 
from Word actually converts the few extra characters of 1252 to the 
"nearest" ones of ISO8859-1 (straight quotation marks, hyphens).>>

This misbehaviour of "save as text" is true of Word 2000, but, following 
further exchange of information with Josep Fontana, I have learned that it 
appears to be fixed in Word 2003.

That is, in Word 2003, I understand that "save as encoded text: West 
European (Windows 1252)" retains these characters, while "save as encoded 
text: West European (ISO 8859-1)" converts them; and simple "save as text" 
retains them also, all of which is very sensible.

In Word 2000, all these options convert the characters.  A workaround is to 
copy the whole text from Word 2000 to the clipboard, and paste it to a 
Unicode-capable plain text editor which has the facility to convert it to 
CP1252 (without turning dashes into hyphens etc). UltraEdit is one such 
editor.

Ciarán Ó Duibhín.



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list