[Corpora-List] Character encoding headaches

Sun Aug 2 22:12:23 UTC 2009

I'm sure many people in the list have experienced the curse of character 
encoding when building corpora so perhaps somebody can help me a bit in 
my plight.
I am building a corpus from files with many different origins. A great 
deal of documents are in .rtf and .doc formats and created with MS Word 
under Windows.

These documents are giving me a lot of headaches. What I did was to open 
them with OpenOffice Writer (as I'm working with Linux) and save them as 
text. Runing the 'file' command in Linux, I see that the format of the 
text file is "UTF-8 Unicode (with BOM) text".

Since the tools I'm using (Freeling from 
http://garraf.epsevg.upc.es/freeling/) only work well when the text is 
ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 <  ...etc.) but I 
get the following error message: "illegal input sequence at position 0".

After this, I tried a different path. I have access to some Windows 
machines so I saved as text directly from within Word choosing the 
Windows text format. Supposedly, this is /Windows/-1252 encoding and I 
assumed this option was the safest since this is the default text format 
for Word (at least with the installation I used). Then I ran iconv -f  
WINDOWS-1252 -t ISO8859-1 but I still got an error message indicating 
that there was a different problematic character ("illegal input 
sequence at position 41"). This is driving me insane.

Does anyone have any good tips to help me out of this nightmare? If 
anybody has had to use MS Word files, how did you go about converting 
them to text formats NLP tools can work with?

Thanks in advance.

Josep M.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora