Arabic-L:LING:Arabic Gigaword Third Edition
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Fri Nov 30 18:21:07 UTC 2007
------------------------------------------------------------------------
Arabic-L: Fri 30 Nov 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic Gigaword Third Edition
-------------------------Messages-----------------------------------
1)
Date: 30 Nov 2007
From:ldc at ldc.upenn.edu
Subject:Arabic Gigaword Third Edition
Arabic Gigaword Third Edition is a comprehensive archive of newswire
text data acquired from Arabic news sources by the LDC at the
University of Pennsylvania. Arabic Gigaword Third Edition includes all
of the content of Arabic Gigaword Second Edition (LDC2006T02) as well
as new data collected after the publication of that edition. Also, an
archive from a new newswire source -- Assabah -- has been included in
the third edition.
The six distinct sources of Arabic newswire represented in the third
edition are:
• Agence France Presse (afp_arb)
• Assabah (asb_arb)
• Al Hayat (hyt_arb)
• An Nahar (nhr_arb)
• Ummah Press (umh_arb)
• Xinhua News Agency (xin_arb)
The seven-character codes in the parantheses above consist of the
three-character source name IDs and the three-character language code
("arb") separated by an underscore ("_") character.
The epochs and document counts for the data in the third edition are
set forth below:
Newly Added DataSource
Date Span
Document Count
Agence France Presse
2005.01 - 2006.12
137815
Assabah News Agency
2004.09 - 2006.12
15410
(new source)
Al Hayat News Agency
2005.01 - 2006.1
8799
(no data for 2004)
An Nahar News Agency
2005.01 - 2006.12
104950
(no data for 2004)
Xinhua News Agency
2005.01 - 2006.12
135472
This release contains 547 files, totaling approximately 1.8GB in
compressed form (6,673 MB uncompressed) and 1,994,735 K-words.
Linguistic Data Consortium Phone: (215) 573-1275 University of
Pennsylvania Fax: (215) 573-2175 3600 Market St., Suite 810 ldc at ldc.upenn.edu
Philadelphia, PA 19104 USA http://www.ldc.upenn.edu
--------------------------------------------------------------------------
End of Arabic-L: 30 Nov 2007
More information about the Arabic-l
mailing list