Arabic-L:LING:LDC resources
Dilworth Parkinson
dil at BYU.EDU
Fri Mar 16 15:36:40 UTC 2007
------------------------------------------------------------------------
Arabic-L: Fri 16 Mar 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:LDC resources
-------------------------Messages-----------------------------------
1)
Date: 16 Mar 2007
From:ldc at ldc.upenn.edu
Subject:LDC resources
The Linguistic Data Consortium (LDC) would like to announce the
availability of two new publications and provide information
regarding forthcoming publications.
..........
(2) ISI Arabic-English Automatically Extracted Parallel Text
consists of Arabic-English parallel sentences which were extracted
automatically from two monolingual corpora: Arabic Gigaword Second
Edition (LDC2006T02) and English Gigaword Second Edition
(LDC2005T12). The data was extracted from news articles published by
Xinhua News Agency and Agence France Presse. The corpus contains
1,124,609 sentence pairs; the word count on the English side is
approximately 31M words. The sentences in the parallel corpus
preserve the form and encoding of the texts in the original Gigaword
corpora.
For each sentence pair in the corpus we provide the names of the
documents from which the two sentences were extracted, as well as a
confidence score (between 0.5 and 1.0), which is indicative of their
degree of parallelism. The parallel sentence identification approach
is designed to judge sentence pairs in isolation from their contexts,
and can therefore find parallel sentences within document pairs which
are not parallel.
In order to make this resource useful for research in Machine
Translation (MT), we made efforts to detect potential overlaps
between this data and the standard test and development data sets
used by the MT community.
------------------------------------------------------------------------
--
End of Arabic-L: 16 Mar 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070316/4ab84f9c/attachment.htm>
More information about the Arabic-l
mailing list