[Corpora-List] Arabic encoding conversion

Abdusalam F Ahmad Nwesri a.nwesri at student.rmit.edu.au
Fri Oct 26 06:23:59 UTC 2007


Hi,

I am trying to convert the Arabic Giga word corpus, prepared by the LDC, from the UTF8 format to windows CP1256 encoding. The collection is purely text with xml tags. 

I tried "iconv" but it seems that there are errors converting some files. I am not sure what is the problem.  

My final solution is to write a script to read the files and convert them word by word, but before I do, I want to know weather anyone has experienced the same problem.

If you are aware of another tool that I can use, please let me know.

Thanks

Abdusalam Nwesri
PhD Candidate, 
School of Computer Science and IT,
RMIT University,
Melbourne,
Australia.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list