[Corpora-List] Character encoding headaches

lars nygaard lars.nygaard at iln.uio.no
Mon Aug 3 07:21:43 UTC 2009


Hi Josep,

You should probably use the //TRANSLIT keyword to iconv:

    iconv -f UTF8 -t ISO8859-1//TRANSLIT <

 From the man page: " This means that when a character cannot be 
represented in the target character set, it can be approximated through 
one or several similarly looking characters."

Best,
Lars Nygaard

Josep M. Fontana wrote:
> I'm sure many people in the list have experienced the curse of 
> character encoding when building corpora so perhaps somebody can help 
> me a bit in my plight.
> I am building a corpus from files with many different origins. A great 
> deal of documents are in .rtf and .doc formats and created with MS 
> Word under Windows.
>
> These documents are giving me a lot of headaches. What I did was to 
> open them with OpenOffice Writer (as I'm working with Linux) and save 
> them as text. Runing the 'file' command in Linux, I see that the 
> format of the text file is "UTF-8 Unicode (with BOM) text".
>
> Since the tools I'm using (Freeling from 
> http://garraf.epsevg.upc.es/freeling/) only work well when the text is 
> ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 <  ...etc.) but 
> I get the following error message: "illegal input sequence at position 
> 0".
>
> After this, I tried a different path. I have access to some Windows 
> machines so I saved as text directly from within Word choosing the 
> Windows text format. Supposedly, this is /Windows/-1252 encoding and I 
> assumed this option was the safest since this is the default text 
> format for Word (at least with the installation I used). Then I ran 
> iconv -f  WINDOWS-1252 -t ISO8859-1 but I still got an error message 
> indicating that there was a different problematic character ("illegal 
> input sequence at position 41"). This is driving me insane.
>
> Does anyone have any good tips to help me out of this nightmare? If 
> anybody has had to use MS Word files, how did you go about converting 
> them to text formats NLP tools can work with?
>
> Thanks in advance.
>
> Josep M.
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list