[Corpora-List] Character encoding headaches

Lluí­s Padró padro at lsi.upc.edu
Mon Aug 3 15:06:30 UTC 2009


 Thank you for leaning on us.
 UTF-8 is certaily on our roadmap, and recently climbed up some positions.

 Nevertheless, it will still take a pretty long time. So, if you are 
waiting for it to do something important, you should probably stop 
waiting and use iconv in the meanwhile....  ;)

       best,

          Lluis Padro


En/na Josep M. Fontana ha escrit:
> Sorry. My information might not be totally accurate. I haven't talked 
> to Lluís directly about this but somebody who works with him (not on 
> the development of Freeling, though) told me that they (Lluís and --if 
> any -- associates) would work on making it compatible with UTF8 in 
> future versions. I realize in my message I said "they are already 
> working on making it compatible with UTF-8", so this might be 
> inaccurate. My apologies for being misleading. It was not my 
> intention. I was simply reacting to the message that said to "lean 
> hard on the Freeling folks" and what I meant to say was really that 
> this is in their road map. I should have been more precise.
>
> Josep M.
>> El dl 03 de 08 del 2009 a les 12:51 +0200, en/na Josep M. Fontana va
>> escriure:
>>  
>>> Thanks a lot to everybody that responded. Problem solved!
>>>
>>> In the end the simplest, quickest solution for me was to use the 
>>> //TRANSLIT keyword as Lars Nygaard suggested. That might not work 
>>> with other kinds of texts but for the Spanish and Catalan texts I'm 
>>> working with, I guess  finding alternative characters  that 
>>> approximate the problematic characters in the original document was 
>>> not too difficult for iconv.
>>>
>>> In response to Ciarán, what is strange if Word saves as ISO-8859-1 
>>> as default is that when you do 'file', this encoding is not 
>>> recognized. The result of running the 'file' command with most of 
>>> the documents saved from within word I'm using is "Non-ISO 
>>> extended-ASCII text, with CRLF line terminators".
>>>
>>> With respect to Freeling, I'm told that they are already working on 
>>> making it compatible with UTF-8.
>>>     
>>
>> Really ? The last I heard from Lluís was:
>> "  No, en unicode no funciona (basicament perque els strings de la STL
>> no suporten unicode encara).  Per processar textos en utf, el que fa es
>> convertir-los a latin, analitzar-los, i tornar-ho a convertir a utf.  "
>>
>> I would love to hear that FreeLing will be supporting UTF-8!!
>>
>> Fran
>>
>>
>>
>>   
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list