Arabic-L:LING:Arabic Gigaword Third Edition

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Tue Nov 20 23:06:55 UTC 2007


------------------------------------------------------------------------
Arabic-L: Tue 20 Nov 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic Gigaword Third Edition

-------------------------Messages-----------------------------------
1)
Date: 20 Nov 2007
From:ldc at ldc.upenn.edu
Subject:Arabic Gigaword Third Edition


(1) Arabic Gigaword Third Edition is a comprehensive archive of  
newswire text data acquired from Arabic news sources by the LDC at  
the University of Pennsylvania. Arabic Gigaword Third Edition  
includes all of the content of Arabic Gigaword Second Edition  
(LDC2006T02) as well as new data collected after the publication of  
that edition. Also, an archive from a new newswire source -- Assabah  
-- has been included in the third edition.

The six distinct sources of Arabic newswire represented in the third  
edition are:

Agence France Presse (afp_arb)
Assabah (asb_arb)
Al Hayat (hyt_arb)
An Nahar (nhr_arb)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb)
The seven-character codes in the parantheses above consist of the  
three-character source name IDs and the three-character language code  
("arb") separated by an underscore ("_") character.

The epochs and document counts for the data in the third edition are  
set forth below:

Newly Added Data





Source

Date Span

Document Count


Agence France Presse

2005.01 - 2006.12

137815


Assabah News Agency

2004.09 - 2006.12

15410

(new source)

Al Hayat News Agency

2005.01 - 2006.1

8799

(no data for 2004)

An Nahar News Agency

2005.01 - 2006.12

104950

(no data for 2004)

Xinhua News Agency

2005.01 - 2006.12

135472


This release contains 547 files, totaling approximately 1.8GB in  
compressed form (6,673 MB uncompressed) and 1,994,735 K-words.   
Arabic Gigaword Third Edition is distributed on one DVD-ROM.

2007 Subscription Members will automatically receive two copies of  
this corpus. 2007 Standard Members may request a copy as part of  
their 16 free membership corpora. Nonmembers may license this data  
for US$4000.


------------------------------------------------------------------------ 
--
End of Arabic-L:  20 Nov 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20071120/5d406cd9/attachment.htm>


More information about the Arabic-l mailing list