[Corpora-List] Arabic encoding conversion
Mike Maxwell
maxwell at umiacs.umd.edu
Fri Oct 26 13:32:01 UTC 2007
Abdusalam F Ahmad Nwesri wrote:
>> I am trying to convert the Arabic Giga word corpus, prepared by the
>> LDC, from the UTF8 format to windows CP1256 encoding. The
>> collection is purely text with xml tags.
>>
>> I tried "iconv" but it seems that there are errors converting some
>> files. I am not sure what is the problem.
If there are chars in the Unicode that can't be converted to CP1256, one
way to track them down would be to list all the Unicode code points,
together with their count. My suspicion is that the non-converting
characters will be fairly rare, and this would give you a way to find
out what they are (e.g. using the grepp utility
http://www.perlmonks.org/?node_id=345275), and perhaps process them some
other way. They might for example be funny quote marks, or accented
characters in loan words written in their original script. And of
course they might simply be errors.
I know of a program that does this kind of character count, but it
doesn't seem to be freely available (at least I can't find it on the
web). Of course it wouldn't be hard to roll your own.
--
Mike Maxwell
maxwell at umiacs.umd.edu
"Theorists...have merely to lock themselves in a room
with a blackboard and coffee maker to conduct their business."
--Bruce A. Schumm, Deep Down Things
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list