Arabic-L:LING:LDC resources

Fri Mar 16 15:36:40 UTC 2007

------------------------------------------------------------------------
Arabic-L: Fri 16 Mar 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:LDC resources

-------------------------Messages-----------------------------------
1)
Date: 16 Mar 2007
From:ldc at ldc.upenn.edu
Subject:LDC resources

The Linguistic Data Consortium (LDC) would like to announce the  
availability of two new publications and provide information  
regarding forthcoming publications.

..........

(2)  ISI Arabic-English Automatically Extracted Parallel Text  
consists of Arabic-English parallel sentences which were extracted  
automatically from two monolingual corpora: Arabic Gigaword Second  
Edition (LDC2006T02) and English Gigaword Second Edition  
(LDC2005T12). The data was extracted from news articles published by  
Xinhua News Agency and Agence France Presse.  The corpus contains  
1,124,609 sentence pairs; the word count on the English side is  
approximately 31M words. The sentences in the parallel corpus  
preserve the form and encoding of the texts in the original Gigaword  
corpora.

For each sentence pair in the corpus we provide the names of the  
documents from which the two sentences were extracted, as well as a  
confidence score (between 0.5 and 1.0), which is indicative of their  
degree of parallelism. The parallel sentence identification approach  
is designed to judge sentence pairs in isolation from their contexts,  
and can therefore find parallel sentences within document pairs which  
are not parallel.

In order to make this resource useful for research in Machine  
Translation (MT), we made efforts to detect potential overlaps  
between this data and the standard test and development data sets  
used by the MT community.

------------------------------------------------------------------------ 
--
End of Arabic-L:  16 Mar 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070316/4ab84f9c/attachment.htm>