[Corpora-List] Character encoding headaches

Josep M. Fontana josepm.fontana at upf.edu
Mon Aug 3 10:51:35 UTC 2009


Thanks a lot to everybody that responded. Problem solved!

In the end the simplest, quickest solution for me was to use the 
//TRANSLIT keyword as Lars Nygaard suggested. That might not work with 
other kinds of texts but for the Spanish and Catalan texts I'm working 
with, I guess  finding alternative characters  that approximate the 
problematic characters in the original document was not too difficult 
for iconv.

In response to Ciarán, what is strange if Word saves as ISO-8859-1 as 
default is that when you do 'file', this encoding is not recognized. The 
result of running the 'file' command with most of the documents saved 
from within word I'm using is "Non-ISO extended-ASCII text, with CRLF 
line terminators".

With respect to Freeling, I'm told that they are already working on 
making it compatible with UTF-8.

Again, thanks to all of you for your time. You cannot imagine (or maybe 
you can :-)) what a relief it is to have gotten out of this encoding 
nightmare.

JM

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list