Arabic-L:LING:LDC Arabic English Translation Corpus

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Mon Mar 19 16:10:46 UTC 2012


------------------------------------------------------------------------
Arabic-L: Mon 19 Mar 2012
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:LDC Arabic English Translation Corpus

-------------------------Messages-----------------------------------
1)
Date: 19 Mar 2012
From:reposted from LDC
Subject:LDC Arabic English Translation Corpus

(1) English Translation Treebank: An Nahar Newswire was developed by
LDC and consists of 599 distinct newswire stories from the Lebanese
publication An Nahar translated from Arabic to English and annotated
for part-of-speech and syntactic structure.
This corpus is part of an ongoing effort at LDC to produce parallel
Arabic and English treebanks. The guidelines followed for both
part-of-speech and syntactic annotation are Penn Treebank II style,
with changes in the tokenization of hyphenated words, part-of-speech
and tree changes necessitated by those tokenization changes and
revisions to the syntactic annotation to comply with the updated
annotation guidelines (including the "Treebank-PropBank merge" or
"Treebank IIa" and "treebank c" changes). The original Penn Treebank
II guidelines, addenda describing changes to the guidelines and the
tokenization specifications can be found on LDC's website.
The data consists of 461,489 tokens in 599 individual files. The news
stories in this release were published in An Nahar in 2002.
The English sources files (translated from the Arabic) were
automatically tokenized, part-of-speech tagged and parsed; the tokens,
tags and parses were manually corrected. The quality control process
consisted of a series of specific searches for over 100 types of
potential inconsistency and parse or annotation error. Any errors
found in those searches were manually corrected.
Annotations are in the following two formats:
Penn Style Trees
Bracketed tree files following the basic form (NODE (TAG token)). Each
sentence is surrounded by a pair of empty parentheses.
AG xml
TreeEditor .xml stand-off annotation files. These files contain the
POS and Treebank annotation and reference the source files by
character offset. DTD files for the AG xml files were moved from their
original location indicated in the readme to be more consistent with
LDC publications.
English Translation Treebank: An Nahar Newswire is distributed via web
download.

2012 Subscription Members will automatically receive two copies of
this corpus on disc. 2012 Standard Members may request a copy as part
of their 16 free membership corpora.   Non-members may license this
data for US$4500.

--------------------------------------------------------------------------
End of Arabic-L:  19 Mar 2012



More information about the Arabic-l mailing list