[Corpora-List] Character encoding headaches

Josep M. Fontana josepm.fontana at upf.edu
Mon Aug 3 12:22:41 UTC 2009


Sorry. My information might not be totally accurate. I haven't talked to 
Lluís directly about this but somebody who works with him (not on the 
development of Freeling, though) told me that they (Lluís and --if any 
-- associates) would work on making it compatible with UTF8 in future 
versions. I realize in my message I said "they are already working on 
making it compatible with UTF-8", so this might be inaccurate. My 
apologies for being misleading. It was not my intention. I was simply 
reacting to the message that said to "lean hard on the Freeling folks" 
and what I meant to say was really that this is in their road map. I 
should have been more precise.

Josep M.
> El dl 03 de 08 del 2009 a les 12:51 +0200, en/na Josep M. Fontana va
> escriure:
>   
>> Thanks a lot to everybody that responded. Problem solved!
>>
>> In the end the simplest, quickest solution for me was to use the 
>> //TRANSLIT keyword as Lars Nygaard suggested. That might not work with 
>> other kinds of texts but for the Spanish and Catalan texts I'm working 
>> with, I guess  finding alternative characters  that approximate the 
>> problematic characters in the original document was not too difficult 
>> for iconv.
>>
>> In response to Ciarán, what is strange if Word saves as ISO-8859-1 as 
>> default is that when you do 'file', this encoding is not recognized. The 
>> result of running the 'file' command with most of the documents saved 
>> from within word I'm using is "Non-ISO extended-ASCII text, with CRLF 
>> line terminators".
>>
>> With respect to Freeling, I'm told that they are already working on 
>> making it compatible with UTF-8.
>>     
>
> Really ? The last I heard from Lluís was: 
>
> "  No, en unicode no funciona (basicament perque els strings de la STL
> no suporten unicode encara).  Per processar textos en utf, el que fa es
> convertir-los a latin, analitzar-los, i tornar-ho a convertir a utf.  "
>
> I would love to hear that FreeLing will be supporting UTF-8!!
>
> Fran
>
>
>
>   


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list