Arabic-L:LING:LDC Arabic English Newswire Translation Collection

Dilworth Parkinson dil at BYU.EDU
Wed Aug 26 23:14:02 UTC 2009


------------------------------------------------------------------------
Arabic-L: Wed 26 Aug 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
             unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:LDC Arabic English Newswire Translation Collection

-------------------------Messages-----------------------------------
1)
Date: 26 Aug 2009
From:ldc at ldc.upenn.edu
Subject:LDC Arabic English Newswire Translation Collection

The Arabic English Newswire Translation Collection consists of  
approximately 550,000 words of Arabic newswire text and its English  
translation from Agence France Presse (France), An Nahar (Lebanon) and  
Assabah (Tunisia). The source Arabic text was used in LDC's Arabic  
Treebank, specifically, in Part 1 (Part 1 v. 2.0;Part 1 v. 3.0), Part  
3 (Part 3 v. 1.0; Part 3 v. 2.0) and Part 4 (Part 4 v. 1.0). A subset  
of Agence France Presse (AFP) source text from Arabic Treebank: Part 1  
v. 2.0 was previously translated and released by LDC in Arabic  
Treebank: Part 1 - 10K-word English Translation, LDC2003T07. The  
English translations in this corpus were provided by translation  
agencies using LDC's Arabic Translation Guidelines.

The number of stories and their epochs for each source are as follows:

AFP

734 stories; July 2000 - November 2000

An Nahar

600 stories; January 2002 - December 2002

Assabah

397 stories; September 2004 - November 2004

Total

1731 stories

Word count of Arabic tokens by source is shown in the following table:

AFP

102,564

An Nahar

299,681

Assabah

149,259


Total

551,504

The original source files used different encodings for the Arabic  
characters, including UTF8 and ASMO. SGML tags were used for marking  
sentence and paragraph boundaries and for annotating other information  
about each story. All Arabic source data was converted to UTF and most  
SGML tags were removed or replaced by "plain text" markers.



--------------------------------------------------------------------------
End of Arabic-L:  26 Aug 2009


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090826/b51c8cdf/attachment.htm>


More information about the Arabic-l mailing list