[Corpora-List] Character encoding headaches
Josep M. Fontana
josepm.fontana at upf.edu
Mon Aug 3 12:22:41 UTC 2009
Sorry. My information might not be totally accurate. I haven't talked to
Lluís directly about this but somebody who works with him (not on the
development of Freeling, though) told me that they (Lluís and --if any
-- associates) would work on making it compatible with UTF8 in future
versions. I realize in my message I said "they are already working on
making it compatible with UTF-8", so this might be inaccurate. My
apologies for being misleading. It was not my intention. I was simply
reacting to the message that said to "lean hard on the Freeling folks"
and what I meant to say was really that this is in their road map. I
should have been more precise.
Josep M.
> El dl 03 de 08 del 2009 a les 12:51 +0200, en/na Josep M. Fontana va
> escriure:
>
>> Thanks a lot to everybody that responded. Problem solved!
>>
>> In the end the simplest, quickest solution for me was to use the
>> //TRANSLIT keyword as Lars Nygaard suggested. That might not work with
>> other kinds of texts but for the Spanish and Catalan texts I'm working
>> with, I guess finding alternative characters that approximate the
>> problematic characters in the original document was not too difficult
>> for iconv.
>>
>> In response to Ciarán, what is strange if Word saves as ISO-8859-1 as
>> default is that when you do 'file', this encoding is not recognized. The
>> result of running the 'file' command with most of the documents saved
>> from within word I'm using is "Non-ISO extended-ASCII text, with CRLF
>> line terminators".
>>
>> With respect to Freeling, I'm told that they are already working on
>> making it compatible with UTF-8.
>>
>
> Really ? The last I heard from Lluís was:
>
> " No, en unicode no funciona (basicament perque els strings de la STL
> no suporten unicode encara). Per processar textos en utf, el que fa es
> convertir-los a latin, analitzar-los, i tornar-ho a convertir a utf. "
>
> I would love to hear that FreeLing will be supporting UTF-8!!
>
> Fran
>
>
>
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list