[Corpora-List] Arabic encoding conversion

Jan Pomikálek xpomikal at fi.muni.cz
Fri Oct 26 08:45:31 UTC 2007


Hi Abdusalam,

There may be two reasons for the errors:
1. the input is not valid utf8
2. the input contains some characters which cannot be converted to
cp1256 (that's indeed possible as utf8 is a superset of cp1256)

I suggest that you run the "iconv" with the "-s" option. That will
silently ignore the characters which cannot be converted rather then
giving an error and failing the conversion.

Cheers,
Jan

on 10/26/07 8:23 AM Abdusalam F Ahmad Nwesri wrote:
> Hi,
> 
> I am trying to convert the Arabic Giga word corpus, prepared by the LDC, from the UTF8 format to windows CP1256 encoding. The collection is purely text with xml tags. 
> 
> I tried "iconv" but it seems that there are errors converting some files. I am not sure what is the problem.  
> 
> My final solution is to write a script to read the files and convert them word by word, but before I do, I want to know weather anyone has experienced the same problem.
> 
> If you are aware of another tool that I can use, please let me know.
> 
> Thanks
> 
> Abdusalam Nwesri
> PhD Candidate, 
> School of Computer Science and IT,
> RMIT University,
> Melbourne,
> Australia.
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list