Arabic-L:LING:LDC Newsbook Parallel Text
Dilworth Parkinson
dil at BYU.EDU
Mon Jun 1 17:33:10 UTC 2009
------------------------------------------------------------------------
Arabic-L: Mon 01 Jun 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:LDC Newsbook Parallel Text
-------------------------Messages-----------------------------------
1)
Date: 01 Jun 2009
From:ldc at ldc.upenn.edu
Subject:LDC Newsbook Parallel Text
GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 contains a
total of 145,000 words (263 files) of Arabic newsgroup text and its
translation selected from thirty-five sources. Newsgroups consist of
posts to electronic bulletin boards, Usenet newsgroups, discussion
groups and similar forums. This release was used as training data in
Phase 1 (year 1) of the DARPA-funded GALE program. This is the second
of a two-part release. GALE Phase 1 Arabic Newsgroup Parallel Text -
Part 1 was released in early 2009.
Preparing the source data involved four stages of work: data scouting,
data harvesting, formating and data selection.
Data scouting involved manually searching the web for suitable
newsgroup text. Data scouts were assigned particular topics and genres
along with a production target in order to focus their web search.
Formal annotation guidelines and a customized annotation toolkit
helped data scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest to
a database. A nightly process queried the annotation database and
harvested all designated URLs. Whenever possible, the entire site was
downloaded, not just the individual thread or post located by the data
scout. Once the text was downloaded, its format was standardized so
that the data could be more easily integrated into downstream
annotation processes. Typically, a new script was required for each
new domain name that was identified. After scripts were run, an
optional manual process corrected any remaining formatting problems.
The selected documents were then reviewed for content-suitability
using a semi-automatic process. A statistical approach was used to
rank a document's relevance to a set of already-selected documents
labeled as "good." An annotator then reviewed the list of relevance-
ranked documents and selected those which were suitable for a
particular annotation task or for annotation in general. These newly-
judged documents in turn provided additional input for the generation
of new ranked lists.
Manual sentence units/segments (SU) annotation was also performed as
part of the transcription task. Three types of end of sentence SU were
identified: statement SU, question SU, and incomplete SU. After
transcription and SU annotation, files were reformatted into a human-
readable translation format and assigned to professional translators
for careful translation. Translators followed LDC's GALE Translation
guidelines which describe the makeup of the translation team, the
source data format, the translation data format, best practices for
translating certain linguistic features and quality control procedures
applied to completed translations.
--------------------------------------------------------------------------
End of Arabic-L: 01 Jun 2009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090601/7c2e4d3f/attachment.htm>
More information about the Arabic-l
mailing list