Arabic-L:LING:LDC Arabic English Newswire Translation Collection
Dilworth Parkinson
dil at BYU.EDU
Wed Aug 26 23:14:02 UTC 2009
------------------------------------------------------------------------
Arabic-L: Wed 26 Aug 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:LDC Arabic English Newswire Translation Collection
-------------------------Messages-----------------------------------
1)
Date: 26 Aug 2009
From:ldc at ldc.upenn.edu
Subject:LDC Arabic English Newswire Translation Collection
The Arabic English Newswire Translation Collection consists of
approximately 550,000 words of Arabic newswire text and its English
translation from Agence France Presse (France), An Nahar (Lebanon) and
Assabah (Tunisia). The source Arabic text was used in LDC's Arabic
Treebank, specifically, in Part 1 (Part 1 v. 2.0;Part 1 v. 3.0), Part
3 (Part 3 v. 1.0; Part 3 v. 2.0) and Part 4 (Part 4 v. 1.0). A subset
of Agence France Presse (AFP) source text from Arabic Treebank: Part 1
v. 2.0 was previously translated and released by LDC in Arabic
Treebank: Part 1 - 10K-word English Translation, LDC2003T07. The
English translations in this corpus were provided by translation
agencies using LDC's Arabic Translation Guidelines.
The number of stories and their epochs for each source are as follows:
AFP
734 stories; July 2000 - November 2000
An Nahar
600 stories; January 2002 - December 2002
Assabah
397 stories; September 2004 - November 2004
Total
1731 stories
Word count of Arabic tokens by source is shown in the following table:
AFP
102,564
An Nahar
299,681
Assabah
149,259
Total
551,504
The original source files used different encodings for the Arabic
characters, including UTF8 and ASMO. SGML tags were used for marking
sentence and paragraph boundaries and for annotating other information
about each story. All Arabic source data was converted to UTF and most
SGML tags were removed or replaced by "plain text" markers.
--------------------------------------------------------------------------
End of Arabic-L: 26 Aug 2009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090826/b51c8cdf/attachment.htm>
More information about the Arabic-l
mailing list