Arabic-L:LING:Arabic-English Parallel Text Corpus

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Thu Feb 22 18:46:20 UTC 2007


------------------------------------------------------------------------
Arabic-L: Thu 22 Feb 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic-English Parallel Text Corpus

-------------------------Messages-----------------------------------
1)
Date: 22 Feb 2007
From:reposeted from LDC
Subject:Arabic-English Parallel Text Corpus

ISI Arabic-English Automatically Extracted Parallel Text consists of  
Arabic-English parallel sentences which were extracted automatically  
from two monolingual corpora: Arabic Gigaword Second Edition  
(LDC2006T02) and English Gigaword Second Edition (LDC2005T12). The  
data was extracted from news articles published by Xinhua News Agency  
and Agence France Presse.  The corpus contains 1,124,609 sentence  
pairs; the word count on the English side is approximately 31M words.  
The sentences in the parallel corpus preserve the form and encoding  
of the texts in the original Gigaword corpora.

For each sentence pair in the corpus we provide the names of the  
documents from which the two sentences were extracted, as well as a  
confidence score (between 0.5 and 1.0), which is indicative of their  
degree of parallelism. The parallel sentence identification approach  
is designed to judge sentence pairs in isolation from their contexts,  
and can therefore find parallel sentences within document pairs which  
are not parallel.

In order to make this resource useful for research in Machine  
Translation (MT), we made efforts to detect potential overlaps  
between this data and the standard test and development data sets  
used by the MT community.  ISI Arabic-English Automatically Extracted  
Parallel Text is distributed via web download.

2007 Subscription Members will automatically receive two copies of  
this corpus on disc. 2007 Standard Members may request a copy as part  
of their 16 free membership corpora. Nonmembers may license this data  
for US$4000.


------------------------------------------------------------------------ 
--
End of Arabic-L:  22 Feb 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070222/3cc29148/attachment.htm>


More information about the Arabic-l mailing list