Arabic-L:LING:New LDC Arabic resources

Dilworth Parkinson dilworth_parkinson at BYU.EDU
Wed Apr 9 14:30:29 UTC 2008


------------------------------------------------------------------------
Arabic-L: Mon 09 Apr 2008
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
            unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:New LDC Arabic resources

-------------------------Messages-----------------------------------
1)
Date: 09 Apr 2008
From:ldc at ldc.upenn.edu
Subject:New LDC Arabic resources

(2)  GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC  
and consists of 102K words (222 files) of Arabic blog text and its  
English translation from thirty-three sources. This release was used  
as training data in Phase 1 of the DARPA-funded GALE program.

The task of preparing this corpus involved four stages of work: data  
scouting, data harvesting, formatting, and data selection.

Data scouting involved manually searching the web for suitable blog  
text. Data scouts were assigned particular topics and genres along  
with a production target in order to focus their web search. Formal  
annotation guidelines and a customized annotation toolkit helped data  
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest  
(sites, threads and posts) to a database. A nightly process queried  
the annotation database and harvested all designated URLs. Whenever  
possible, the entire site was downloaded, not just the individual  
thread or post located by the data scout.

Once the text was downloaded, its format was standardized so that the  
data could be more easily integrated into downstream annotation  
processes. Typically a new script was required for each new domain  
name that was identified. After scripts were run, an optional manual  
process corrected any remaining formatting problems.

The selected documents were then reviewed for content suitability  
using a semi-automatic process. A statistical approach was used to  
rank a document's relevance to a set of already-selected documents  
labeled as "good." An annotator then reviewed the list of relevance- 
ranked documents and selected those which were suitable for a  
particular annotation task or for annotation in general.

After files were selected, they were reformatted into a human-readable  
translation format, and the files were then assigned to professional  
translators for careful translation. Translators followed LDC's GALE  
Translation guidelines, which describe the makeup of the translation  
team, the source, data format, the translation data format, best  
practices for translating certain linguistic features (such as names  
and speech disfluencies), and quality control procedures applied to  
completed translations.

All final data are in Tab Delimited Format (TDF). TDF is compatible  
with other transcription formats, such as the Transcriber format and  
AG format, and it is easy to process.  Each line of a TDF file  
corresponds to a speech segment and contains 13 tab delimited field.A  
source TDF file and its translation are the same except that the  
transcript in the source TDF is replaced by its English translation.

--------------------------------------------------------------------------
End of Arabic-L:  09 Apr 2008



More information about the Arabic-l mailing list