Arabic-L:LING:Arabic Gigaword 5th Edition

Dilworth Parkinson dil at BYU.EDU
Sat Nov 12 12:52:22 UTC 2011


------------------------------------------------------------------------
Arabic-L: Sat 12 Nov 2011
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu<mailto:dilworth_parkinson at byu.edu>>
[To post messages to the list, send them to arabic-l at byu.edu<mailto:arabic-l at byu.edu>]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu<mailto:listserv at byu.edu> with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic Gigaword 5th Edition

-------------------------Messages-----------------------------------
1)
Date: 12 Nov 2011
From:Linguistic Data Consortium <ldc at ldc.upenn.edu<mailto:ldc at ldc.upenn.edu>> (reposted from CORPORA)
Subject:Arabic Gigaword 5th Edition


(2) Arabic Gigaword Fifth Edition<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T11> is a comprehensive archive of newswire text data that has been acquired from Arabic news sources over several years by LDC. Arabic Gigaword Fifth Edition includes all of the content of the fourth edition of Arabic Gigaword (LDC2009T30<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T30>) plus new data covering the period from January 1, 2009 through December 31, 2010.
Nine distinct sources of Arabic newswire are represented in this distribution:
Asharq Al-Awsat (aaw_arb)
Agence France Presse (afp_arb)
Al-Ahram (ahr_arb)
Assabah (asb_arb)
Al Hayat (hyt_arb)
An Nahar (nhr_arb)
Al-Quds Al-Arabi (qds_arb)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-character language code conforms to the ISO 639-3<http://www.sil.org/iso639-3/default.asp> standard.
In addition to adding new data, the following updates were made:
Repeated documents in Asharq Al-Awsat data from 2008 were removed.
Document formatting and docid duplication problems were corrected in Agence France Presse data.
Significant duplication of content in 2007-2008 An Nahar data was detected, and the duplicated documents were removed.

--------------------------------------------------------------------------
End of Arabic-L:  12 Nov 2011

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20111112/4d181648/attachment.htm>


More information about the Arabic-l mailing list