[Corpora-List] Arabic encoding conversion
Jan Pomikálek
xpomikal at fi.muni.cz
Fri Oct 26 08:45:31 UTC 2007
Hi Abdusalam,
There may be two reasons for the errors:
1. the input is not valid utf8
2. the input contains some characters which cannot be converted to
cp1256 (that's indeed possible as utf8 is a superset of cp1256)
I suggest that you run the "iconv" with the "-s" option. That will
silently ignore the characters which cannot be converted rather then
giving an error and failing the conversion.
Cheers,
Jan
on 10/26/07 8:23 AM Abdusalam F Ahmad Nwesri wrote:
> Hi,
>
> I am trying to convert the Arabic Giga word corpus, prepared by the LDC, from the UTF8 format to windows CP1256 encoding. The collection is purely text with xml tags.
>
> I tried "iconv" but it seems that there are errors converting some files. I am not sure what is the problem.
>
> My final solution is to write a script to read the files and convert them word by word, but before I do, I want to know weather anyone has experienced the same problem.
>
> If you are aware of another tool that I can use, please let me know.
>
> Thanks
>
> Abdusalam Nwesri
> PhD Candidate,
> School of Computer Science and IT,
> RMIT University,
> Melbourne,
> Australia.
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list