[Corpora-List] Arabic encoding guesser

Tue Jul 29 14:51:33 UTC 2008

El mar, 29-07-2008 a las 10:19 -0400, David Graff escribió:
> Serge,
> 
> I'd be interested in learning about any examples you've seen to the 
> contrary, but for the most part, there are basically two choices for 
> encoding Arabic web pages: single-byte and utf-8.

If you only need to detect between single-byte and UTF-8, the unix
utility "file" should suffice:

$ wget -q -O - http://www.bbc.co.uk/arabic | sed 's/<.*>//g' | file -
/dev/stdin: ISO-8859 text, with very long lines, with CRLF, LF line
terminators

$ wget -q -O - http://ar.wikipedia.org | sed 's/<.*>//g' | file -
/dev/stdin: UTF-8 Unicode text, with very long lines

This is pretty crude but it seems to work with the few examples I've
tried.

Fran

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora