<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><pre id="nonprop"><p align=""><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 12px; ">------------------------------------------------------------------------
Arabic-L: Wed 26 Aug 2009
Moderator: Dilworth Parkinson <<a href="mailto:dilworth_parkinson@byu.edu">dilworth_parkinson@byu.edu</a>>
[To post messages to the list, send them to <a href="mailto:arabic-l@byu.edu">arabic-l@byu.edu</a>]
[To unsubscribe, send message from same address you subscribed from to
<a href="mailto:listserv@byu.edu">listserv@byu.edu</a> with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:LDC Arabic English Newswire Translation Collection
-------------------------Messages-----------------------------------
1)
Date: 26 Aug 2009
From:<a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
Subject:LDC Arabic English Newswire Translation Collection
<span class="Apple-style-span" style="font-size: medium; white-space: normal; "><p><span style="color: black; ">The</span><span style="color: rgb(153, 0, 0); "> </span><span style="color: black; "><a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009T22">Arabic English Newswire Translation Collection</a></span> consists of approximately 550,000 words of Arabic newswire text and its English translation from Agence France Presse (France), An Nahar (<st1:country-region><st1:place>Lebanon</st1:place></st1:country-region>) and Assabah (<st1:country-region><st1:place>Tunisia</st1:place></st1:country-region>). The source Arabic text was used in LDC's Arabic Treebank, specifically, in Part 1 (<a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T06">Part 1 v. 2.0</a>;<a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T02">Part 1 v. 3.0</a>), Part 3 (<a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T11">Part 3 v. 1.0</a>; <a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T20">Part 3 v. 2.0</a>) and Part 4 (<a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T30">Part 4 v. 1.0</a>). A subset of Agence France Presse (AFP) source text from Arabic Treebank: Part 1 v. 2.0 was previously translated and released by LDC in <a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T07">Arabic Treebank: Part 1 - 10K-word English Translation, LDC2003T07</a>. The English translations in this corpus were provided by translation agencies using LDC's Arabic Translation Guidelines.<o:p></o:p></p><p>The number of stories and their epochs for each source are as follows:<o:p></o:p></p><table class="MsoNormalTable" border="0" cellpadding="0"><tbody><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">AFP<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">734 stories; July 2000 - November 2000<o:p></o:p></p></td></tr><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">An Nahar<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">600 stories; January 2002 - December 2002<o:p></o:p></p></td></tr><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">Assabah<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">397 stories; September 2004 - November 2004<o:p></o:p></p></td></tr><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">Total<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">1731 stories<o:p></o:p></p></td></tr></tbody></table><p>Word count of Arabic tokens by source is shown in the following table:<o:p></o:p></p><table class="MsoNormalTable" border="0" cellpadding="0"><tbody><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">AFP<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">102,564<o:p></o:p></p></td></tr><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">An Nahar<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">299,681<o:p></o:p></p></td></tr><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">Assabah<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">149,259<o:p></o:p></p></td></tr><tr><td colspan="2" style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><div class="MsoNormal" align="center" style="text-align: center; "><hr align="center" size="2" width="100%"></div><p class="MsoNormal" align="center" style="text-align: center; "><o:p></o:p></p></td></tr><tr><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">Total<o:p></o:p></p></td><td style="padding-top: 0.75pt; padding-right: 0.75pt; padding-bottom: 0.75pt; padding-left: 0.75pt; "><p class="MsoNormal">551,504<o:p></o:p></p></td></tr></tbody></table><p>The original source files used different encodings for the Arabic characters, including UTF8 and ASMO. SGML tags were used for marking sentence and paragraph boundaries and for annotating other information about each story. All Arabic source data was converted to UTF and most SGML tags were removed or replaced by "plain text" markers.<o:p></o:p></p><div><br></div></span>
--------------------------------------------------------------------------
End of Arabic-L: 26 Aug 2009
</span></font></p><div><font class="Apple-style-span" face="Helvetica" size="3"><span class="Apple-style-span" style="font-size: 12px;"><br></span></font></div></pre></body></html>