[Corpora-List] Character encoding headaches
Josep M. Fontana
josepm.fontana at upf.edu
Mon Aug 3 10:51:35 UTC 2009
Thanks a lot to everybody that responded. Problem solved!
In the end the simplest, quickest solution for me was to use the
//TRANSLIT keyword as Lars Nygaard suggested. That might not work with
other kinds of texts but for the Spanish and Catalan texts I'm working
with, I guess finding alternative characters that approximate the
problematic characters in the original document was not too difficult
for iconv.
In response to Ciarán, what is strange if Word saves as ISO-8859-1 as
default is that when you do 'file', this encoding is not recognized. The
result of running the 'file' command with most of the documents saved
from within word I'm using is "Non-ISO extended-ASCII text, with CRLF
line terminators".
With respect to Freeling, I'm told that they are already working on
making it compatible with UTF-8.
Again, thanks to all of you for your time. You cannot imagine (or maybe
you can :-)) what a relief it is to have gotten out of this encoding
nightmare.
JM
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list