[Corpora-List] Character encoding headaches
lars nygaard
lars.nygaard at iln.uio.no
Mon Aug 3 07:21:43 UTC 2009
Hi Josep,
You should probably use the //TRANSLIT keyword to iconv:
iconv -f UTF8 -t ISO8859-1//TRANSLIT <
From the man page: " This means that when a character cannot be
represented in the target character set, it can be approximated through
one or several similarly looking characters."
Best,
Lars Nygaard
Josep M. Fontana wrote:
> I'm sure many people in the list have experienced the curse of
> character encoding when building corpora so perhaps somebody can help
> me a bit in my plight.
> I am building a corpus from files with many different origins. A great
> deal of documents are in .rtf and .doc formats and created with MS
> Word under Windows.
>
> These documents are giving me a lot of headaches. What I did was to
> open them with OpenOffice Writer (as I'm working with Linux) and save
> them as text. Runing the 'file' command in Linux, I see that the
> format of the text file is "UTF-8 Unicode (with BOM) text".
>
> Since the tools I'm using (Freeling from
> http://garraf.epsevg.upc.es/freeling/) only work well when the text is
> ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 < ...etc.) but
> I get the following error message: "illegal input sequence at position
> 0".
>
> After this, I tried a different path. I have access to some Windows
> machines so I saved as text directly from within Word choosing the
> Windows text format. Supposedly, this is /Windows/-1252 encoding and I
> assumed this option was the safest since this is the default text
> format for Word (at least with the installation I used). Then I ran
> iconv -f WINDOWS-1252 -t ISO8859-1 but I still got an error message
> indicating that there was a different problematic character ("illegal
> input sequence at position 41"). This is driving me insane.
>
> Does anyone have any good tips to help me out of this nightmare? If
> anybody has had to use MS Word files, how did you go about converting
> them to text formats NLP tools can work with?
>
> Thanks in advance.
>
> Josep M.
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list