Arabic-L:LING:Arabic Gigaword 4 from LDC
Dilworth Parkinson
dil at BYU.EDU
Tue Dec 29 19:40:02 UTC 2009
------------------------------------------------------------------------
Arabic-L: Tue 29 Dec 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic Gigaword 4 from LDC
-------------------------Messages-----------------------------------
1)
Date: 29 Dec 2009
From:from LDC
Subject:Arabic Gigaword 4 from LDC
(2) Arabic Gigaword Fourth Edition is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi.
Nine distinct international sources of Arabic newswire are represented here:
Al-Ahram (ahr_arb)
Asharq Al-Awsat (aaw_arb)
Agence France Presse (afp_arb)
Assabah (asb_arb)
Al Hayat (hyt_arb)
An Nahar (nhr_arb)
Al-Quds Al-Arabi (qds_arb)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character.
These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects.
New in the Fourth Edition
New Sources
This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008.
New Data for Existing Sources
This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included.
The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs).
Source
#Files
Gzip-MB
Totl-MB
K-wrds
#DOCs
aaw_arb
26
114
386
36694
87506
afp_arb
176
530
1979
184631
930656
ahr_arb
26
114
131
42265
107187
asb_arb
52
45
149
14322
32794
hyt_arb
166
663
2224
209318
448335
nhr_arb
157
784
2662
253559
557151
qds_arb
26
62
198
18996
49352
umh_arb
68
9.3
31
2995
11350
xin_arb
91
245
890
85689
492664
Totals
788
5018
8650
848469
2716995
Arabic Gigaword Fourth Edition is distributed on one DVD-ROM.
2009 Subscription Members will automatically receive two copies of this corpus. 2009 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$5000.
--------------------------------------------------------------------------
End of Arabic-L: 29 Dec 2009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20091229/91b7693c/attachment.htm>
More information about the Arabic-l
mailing list