Arabic-L:LING:Arabic Gigaword Third Edition
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Tue Nov 20 23:06:55 UTC 2007
------------------------------------------------------------------------
Arabic-L: Tue 20 Nov 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic Gigaword Third Edition
-------------------------Messages-----------------------------------
1)
Date: 20 Nov 2007
From:ldc at ldc.upenn.edu
Subject:Arabic Gigaword Third Edition
(1) Arabic Gigaword Third Edition is a comprehensive archive of
newswire text data acquired from Arabic news sources by the LDC at
the University of Pennsylvania. Arabic Gigaword Third Edition
includes all of the content of Arabic Gigaword Second Edition
(LDC2006T02) as well as new data collected after the publication of
that edition. Also, an archive from a new newswire source -- Assabah
-- has been included in the third edition.
The six distinct sources of Arabic newswire represented in the third
edition are:
Agence France Presse (afp_arb)
Assabah (asb_arb)
Al Hayat (hyt_arb)
An Nahar (nhr_arb)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb)
The seven-character codes in the parantheses above consist of the
three-character source name IDs and the three-character language code
("arb") separated by an underscore ("_") character.
The epochs and document counts for the data in the third edition are
set forth below:
Newly Added Data
Source
Date Span
Document Count
Agence France Presse
2005.01 - 2006.12
137815
Assabah News Agency
2004.09 - 2006.12
15410
(new source)
Al Hayat News Agency
2005.01 - 2006.1
8799
(no data for 2004)
An Nahar News Agency
2005.01 - 2006.12
104950
(no data for 2004)
Xinhua News Agency
2005.01 - 2006.12
135472
This release contains 547 files, totaling approximately 1.8GB in
compressed form (6,673 MB uncompressed) and 1,994,735 K-words.
Arabic Gigaword Third Edition is distributed on one DVD-ROM.
2007 Subscription Members will automatically receive two copies of
this corpus. 2007 Standard Members may request a copy as part of
their 16 free membership corpora. Nonmembers may license this data
for US$4000.
------------------------------------------------------------------------
--
End of Arabic-L: 20 Nov 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20071120/5d406cd9/attachment.htm>
More information about the Arabic-l
mailing list