[Corpora-List] Arabic encoding guesser

David Graff graff at ldc.upenn.edu
Tue Jul 29 14:19:05 UTC 2008


Serge,

I'd be interested in learning about any examples you've seen to the 
contrary, but for the most part, there are basically two choices for 
encoding Arabic web pages: single-byte and utf-8.

(There is of course a third choice, involving image data or pdf files,
which makes the character data quite difficult to get at.)

For single-byte encoding, 8859-6 and cp1256 are basically equivalent (and
MacArabic, which is unlikely to be found in use on the web, is only
slightly different).  In a sense, cp1256 is a superset of 8859-6, so
treating all single-byte data as if it were cp1256 should be fine.

In effect, if you are already confident that the web-page content is in
Arabic (prior to processing the text), the question becomes a matter of
detecting utf8 encoding, which is quite simple and can be done in a
variety of ways.

Perhaps the easiest way is to push the data through some process that
expects utf8 data and reports an error condition when the input is not
well-formed utf8.  When it reports an error, treat the data as cp1256.

(Determining whether web page content is in Arabic is a separate 
question, and for that, you should still be checking for utf8 encoding 
first, because if it is utf8, language detection is a relatively less 
complicated task.)

   Best regards,
	David Graff



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list