[Corpora-List] Arabic encoding conversion

Mike Maxwell maxwell at umiacs.umd.edu
Fri Oct 26 13:32:01 UTC 2007


Abdusalam F Ahmad Nwesri wrote:
>> I am trying to convert the Arabic Giga word corpus, prepared by the
>> LDC, from the UTF8 format to windows CP1256 encoding. The
>> collection is purely text with xml tags.
>> 
>> I tried "iconv" but it seems that there are errors converting some
>> files. I am not sure what is the problem.

If there are chars in the Unicode that can't be converted to CP1256, one 
way to track them down would be to list all the Unicode code points, 
together with their count.  My suspicion is that the non-converting 
characters will be fairly rare, and this would give you a way to find 
out what they are (e.g. using the grepp utility 
http://www.perlmonks.org/?node_id=345275), and perhaps process them some 
other way.  They might for example be funny quote marks, or accented 
characters in loan words written in their original script.  And of 
course they might simply be errors.

I know of a program that does this kind of character count, but it 
doesn't seem to be freely available (at least I can't find it on the 
web).  Of course it wouldn't be hard to roll your own.
-- 
	Mike Maxwell
	maxwell at umiacs.umd.edu
	"Theorists...have merely to lock themselves in a room
	with a blackboard and coffee maker to conduct their business."
	--Bruce A. Schumm, Deep Down Things

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list