Arabic-L:LING:Arabic-English Parallel Text Corpus
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Thu Feb 22 18:46:20 UTC 2007
------------------------------------------------------------------------
Arabic-L: Thu 22 Feb 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:Arabic-English Parallel Text Corpus
-------------------------Messages-----------------------------------
1)
Date: 22 Feb 2007
From:reposeted from LDC
Subject:Arabic-English Parallel Text Corpus
ISI Arabic-English Automatically Extracted Parallel Text consists of
Arabic-English parallel sentences which were extracted automatically
from two monolingual corpora: Arabic Gigaword Second Edition
(LDC2006T02) and English Gigaword Second Edition (LDC2005T12). The
data was extracted from news articles published by Xinhua News Agency
and Agence France Presse. The corpus contains 1,124,609 sentence
pairs; the word count on the English side is approximately 31M words.
The sentences in the parallel corpus preserve the form and encoding
of the texts in the original Gigaword corpora.
For each sentence pair in the corpus we provide the names of the
documents from which the two sentences were extracted, as well as a
confidence score (between 0.5 and 1.0), which is indicative of their
degree of parallelism. The parallel sentence identification approach
is designed to judge sentence pairs in isolation from their contexts,
and can therefore find parallel sentences within document pairs which
are not parallel.
In order to make this resource useful for research in Machine
Translation (MT), we made efforts to detect potential overlaps
between this data and the standard test and development data sets
used by the MT community. ISI Arabic-English Automatically Extracted
Parallel Text is distributed via web download.
2007 Subscription Members will automatically receive two copies of
this corpus on disc. 2007 Standard Members may request a copy as part
of their 16 free membership corpora. Nonmembers may license this data
for US$4000.
------------------------------------------------------------------------
--
End of Arabic-L: 22 Feb 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070222/3cc29148/attachment.htm>
More information about the Arabic-l
mailing list