[Corpora-List] Arabic encoding conversion
David Graff
graff at ldc.upenn.edu
Fri Oct 26 20:04:38 UTC 2007
As explained in the readme file that comes with the Gigaword Arabic
corpus, Al Hayat was originally delivered to the LDC in the form of
MacArabic-encoded text data. This was the only news source in the
corpus that did not originally use either CP-1256 encoding, or ASMO 499
encoding (which uses a fully compatible subset of the Arabic characters
in CP-1256). The readme file is available on the web:
http://www.ldc.upenn.edu/Catalog/docs/LDC2006T02/0readme.txt
As it turns out, virtually all of the Al Hayat data files in the corpus
contain numerous instances of one or more of the following unicode code
points, which are not mappable to CP-1256:
U+066A (Arabic percent sign)
U+06A4 (Arabic letter "veh")
U+274A ("eight teardrop-spoked propeller asterisk")
Also, contrary to what was stated in the readme, the Al Hayat data files
from 2002 and 2003 contain many instances of the "Arabic-Indic digits"
(U+0660 - U+0669), which will cause errors when converting to CP-1256.
These should have been converted to the common ASCII digits, and this
corpus bug will be fixed in the Release 3 version of Gigaword Arabic,
which will come out later this year.
Finally, there are a just a few Al Hayat files (200201, 200202, 200302,
200310, 200311) containing one or two accented Latin1 characters (U+00C9
"capital letter e with acute", U+00E4 "small letter a with diaresis",
U+00F6 "small letter o with diaresis"), because a given news story
contained a few French or German words embedded in the Arabic text.
In order to convert the Al Hayat data to CP-1256, the various code
points mentioned above would need to be removed, or replaced with
suitable substitute characters that are mappable to CP-1256. For most
of the characters involved, the selection of a replacement is simple.
In the case of the "Arabic letter veh" (U+06A4), I checked with Tim
Buckwalter, who looked at how this letter was being used in context,
and he suggested that a suitable replacement would be U+0641 (Arabic
letter feh).
Assuming you have a recent version of Perl (v5.8.1 or higher), the Al
Hayat data files are in a directory where you have read/write access,
and you are using a command-line shell environment where the GNU
"gunzip" tool is in your execution path, the perl script provided below
can be used to create a CP-1256 plain-text (uncompressed) file for each
of the original utf8 files from the corpus (no need to use iconv in this
case).
David Graff
#!/usr/bin/perl
foreach $file (<hyt_arb_*.gz>) {
open(I,"-|:utf8","gunzip -c $file") or die "$file: $!";
$file =~ s/gz$/txt/;
open(O,">:encoding(cp1256)",$file) or die "$file: $!";
while (<I>) {
tr/\x{0660}-\x{066A}\x{06A4}\x{274A}\x{00C9}\x{00E4}\x{00F6}/0-9%\x{0641}*Eao/;
print O;
}
close I;
close O;
}
__END__
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list