[Corpora-List] Arabic encoding guesser
Francis Tyers
ftyers at prompsit.com
Tue Jul 29 14:51:33 UTC 2008
El mar, 29-07-2008 a las 10:19 -0400, David Graff escribió:
> Serge,
>
> I'd be interested in learning about any examples you've seen to the
> contrary, but for the most part, there are basically two choices for
> encoding Arabic web pages: single-byte and utf-8.
If you only need to detect between single-byte and UTF-8, the unix
utility "file" should suffice:
$ wget -q -O - http://www.bbc.co.uk/arabic | sed 's/<.*>//g' | file -
/dev/stdin: ISO-8859 text, with very long lines, with CRLF, LF line
terminators
$ wget -q -O - http://ar.wikipedia.org | sed 's/<.*>//g' | file -
/dev/stdin: UTF-8 Unicode text, with very long lines
This is pretty crude but it seems to work with the few examples I've
tried.
Fran
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list