[Corpora-List] Character encoding headaches
Josep M. Fontana
josepm.fontana at upf.edu
Sun Aug 2 22:12:23 UTC 2009
I'm sure many people in the list have experienced the curse of character
encoding when building corpora so perhaps somebody can help me a bit in
my plight.
I am building a corpus from files with many different origins. A great
deal of documents are in .rtf and .doc formats and created with MS Word
under Windows.
These documents are giving me a lot of headaches. What I did was to open
them with OpenOffice Writer (as I'm working with Linux) and save them as
text. Runing the 'file' command in Linux, I see that the format of the
text file is "UTF-8 Unicode (with BOM) text".
Since the tools I'm using (Freeling from
http://garraf.epsevg.upc.es/freeling/) only work well when the text is
ISO-8859-1, I ran 'iconv' (iconv -f UTF8 -t ISO8859-1 < ...etc.) but I
get the following error message: "illegal input sequence at position 0".
After this, I tried a different path. I have access to some Windows
machines so I saved as text directly from within Word choosing the
Windows text format. Supposedly, this is /Windows/-1252 encoding and I
assumed this option was the safest since this is the default text format
for Word (at least with the installation I used). Then I ran iconv -f
WINDOWS-1252 -t ISO8859-1 but I still got an error message indicating
that there was a different problematic character ("illegal input
sequence at position 41"). This is driving me insane.
Does anyone have any good tips to help me out of this nightmare? If
anybody has had to use MS Word files, how did you go about converting
them to text formats NLP tools can work with?
Thanks in advance.
Josep M.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list