Arabic-L:LING:Arabic News Parallel Text corpus from LDC

Wed Feb 18 19:19:46 UTC 2009

------------------------------------------------------------------------
Arabic-L: Mon 18 Feb 2009
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:Arabic News Parallel Text corpus from LDC

-------------------------Messages-----------------------------------
1)
Date: 18 Feb 2009
From:ldc at ldc.upenn.edu
Subject:Arabic News Parallel Text corpus from LDC

(2) GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 was prepared  
by LDC and contains a total of 178,000 words (264 files) of Arabic  
newsgroup text and its translation selected from thirty-five sources.  
Newsgroups consist of posts to electronic bulletin boards, Usenet  
newsgroups, discussion groups and similar forums. This release was  
used as training data in Phase 1 (year 1) of the DARPA-funded GALE  
program.

Preparing the source data involved four stages of work: data scouting,  
data harvesting, formatting and data selection.

Data scouting involved manually searching the web for suitable  
newsgroup text. Data scouts were assigned particular topics and genres  
along with a production target in order to focus their web search.  
Formal annotation guidelines and a customized annotation toolkit  
helped data scouts to manage the search process and to track progress.

Data scouts logged their decisions about potential text of interest to  
a database. A nightly process queried the annotation database and  
harvested all designated URLs. Whenever possible, the entire site was  
downloaded, not just the individual thread or post located by the data  
scout. Once the text was downloaded, its format was standardized so  
that the data could be more easily integrated into downstream  
annotation processes. Typically, a new script was required for each  
new domain name that was identified. After scripts were run, an  
optional manual process corrected any remaining formatting problems.

The selected documents were then reviewed for content-suitability  
using a semi-automatic process. A statistical approach was used to  
rank a document's relevance to a set of already-selected documents  
labeled as "good." An annotator then reviewed the list of relevance- 
ranked documents and selected those which were suitable for a  
particular annotation task or for annotation in general. These newly- 
judged documents in turn provided additional input for the generation  
of new ranked lists.

Manual sentence units/segments (SU) annotation was also performed as  
part of the transcription task. Three types of end of sentence SU were  
identified: statement SU, question SU, and incomplete SU. After  
transcription and SU annotation, files were reformatted into a human- 
readable translation format and assigned to professional translators  
for careful translation. Translators followed LDC's GALE Translation  
guidelines which describe the makeup of the translation team, the  
source data format, the translation data format, best practices for  
translating certain linguistic features and quality control procedures  
applied to completed translations.

All final data are presented in Tab Delimited Format (TDF). TDF is  
compatible with other transcription formats, such as the Transcriber  
format and AG format making it easy to process.

GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 is distributed  
via web download.

Linguistic Data Consortium                     Phone: (215) 573-1275
University of Pennsylvania                       Fax: (215) 573-2175
3600 Market St., Suite 810                         ldc at ldc.upenn.edu
  Philadelphia, PA 19104 USA                   http://www.ldc.upenn.edu

--------------------------------------------------------------------------
End of Arabic-L:  18 Feb 2009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20090218/3dcc85cf/attachment.htm>