Arabic-L:LING:Arabic Gigaword Third Edition

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Fri Nov 30 18:21:07 UTC 2007


------------------------------------------------------------------------
Arabic-L: Fri 30 Nov 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic Gigaword Third Edition

-------------------------Messages-----------------------------------
1)
Date: 30 Nov 2007
From:ldc at ldc.upenn.edu
Subject:Arabic Gigaword Third Edition

Arabic Gigaword Third Edition is a comprehensive archive of newswire  
text data acquired from Arabic news sources by the LDC at the  
University of Pennsylvania. Arabic Gigaword Third Edition includes all  
of the content of Arabic Gigaword Second Edition (LDC2006T02) as well  
as new data collected after the publication of that edition. Also, an  
archive from a new newswire source -- Assabah -- has been included in  
the third edition.

The six distinct sources of Arabic newswire represented in the third  
edition are:

	• Agence France Presse (afp_arb)
	• Assabah (asb_arb)
	• Al Hayat (hyt_arb)
	• An Nahar (nhr_arb)
	• Ummah Press (umh_arb)
	• Xinhua News Agency (xin_arb)
The seven-character codes in the parantheses above consist of the  
three-character source name IDs and the three-character language code  
("arb") separated by an underscore ("_") character.

The epochs and document counts for the data in the third edition are  
set forth below:

Newly Added DataSource

Date Span

Document Count



Agence France Presse

2005.01 - 2006.12

137815



Assabah News Agency

2004.09 - 2006.12

15410

(new source)

Al Hayat News Agency

2005.01 - 2006.1

8799

(no data for 2004)

An Nahar News Agency

2005.01 - 2006.12

104950

(no data for 2004)

Xinhua News Agency

2005.01 - 2006.12

135472

This release contains 547 files, totaling approximately 1.8GB in  
compressed form (6,673 MB uncompressed) and 1,994,735 K-words.

Linguistic Data Consortium Phone: (215) 573-1275 University of  
Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu 
  Philadelphia, PA 19104 USA http://www.ldc.upenn.edu

--------------------------------------------------------------------------
End of Arabic-L:  30 Nov 2007



More information about the Arabic-l mailing list