Arabic-L:LING:Arabic Gigaword 4 from LDC

Dilworth Parkinson dil at BYU.EDU
Tue Dec 29 19:40:02 UTC 2009


------------------------------------------------------------------------
Arabic-L: Tue 29 Dec 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic Gigaword 4 from LDC

-------------------------Messages-----------------------------------
1)
Date: 29 Dec 2009
From:from LDC
Subject:Arabic Gigaword 4 from LDC

(2)  Arabic Gigaword Fourth Edition is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi.

Nine distinct international sources of Arabic newswire are represented here:

Al-Ahram (ahr_arb)
Asharq Al-Awsat (aaw_arb)
Agence France Presse (afp_arb)
Assabah (asb_arb)
Al Hayat (hyt_arb)
An Nahar (nhr_arb)
Al-Quds Al-Arabi (qds_arb)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character.

These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects.

New in the Fourth Edition

New Sources
      This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008. 

New Data for Existing Sources
      This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included.

The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs).

Source

#Files

Gzip-MB

Totl-MB

K-wrds

#DOCs

aaw_arb

26

114

386

36694

87506

afp_arb

176

530

1979

184631

930656

ahr_arb

26

114

131

42265

107187

asb_arb

52

45

149

14322

32794

hyt_arb

166

663

2224

209318

448335

nhr_arb

157

784

2662

253559

557151

qds_arb

26

62

198

18996

49352

umh_arb

68

9.3

31

2995

11350

xin_arb

91

245

890

85689

492664

Totals

788

5018

8650

848469

2716995


Arabic Gigaword Fourth Edition is distributed on one DVD-ROM.

2009 Subscription Members will automatically receive two copies of this corpus.  2009 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for US$5000.


--------------------------------------------------------------------------
End of Arabic-L:  29 Dec 2009


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20091229/91b7693c/attachment.htm>


More information about the Arabic-l mailing list