Arabic-L:LING:New LDC Arabic resources
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Wed Apr 9 14:30:29 UTC 2008
------------------------------------------------------------------------
Arabic-L: Mon 09 Apr 2008
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New LDC Arabic resources
-------------------------Messages-----------------------------------
1)
Date: 09 Apr 2008
From:ldc at ldc.upenn.edu
Subject:New LDC Arabic resources
(2) GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC
and consists of 102K words (222 files) of Arabic blog text and its
English translation from thirty-three sources. This release was used
as training data in Phase 1 of the DARPA-funded GALE program.
The task of preparing this corpus involved four stages of work: data
scouting, data harvesting, formatting, and data selection.
Data scouting involved manually searching the web for suitable blog
text. Data scouts were assigned particular topics and genres along
with a production target in order to focus their web search. Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest
(sites, threads and posts) to a database. A nightly process queried
the annotation database and harvested all designated URLs. Whenever
possible, the entire site was downloaded, not just the individual
thread or post located by the data scout.
Once the text was downloaded, its format was standardized so that the
data could be more easily integrated into downstream annotation
processes. Typically a new script was required for each new domain
name that was identified. After scripts were run, an optional manual
process corrected any remaining formatting problems.
The selected documents were then reviewed for content suitability
using a semi-automatic process. A statistical approach was used to
rank a document's relevance to a set of already-selected documents
labeled as "good." An annotator then reviewed the list of relevance-
ranked documents and selected those which were suitable for a
particular annotation task or for annotation in general.
After files were selected, they were reformatted into a human-readable
translation format, and the files were then assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines, which describe the makeup of the translation
team, the source, data format, the translation data format, best
practices for translating certain linguistic features (such as names
and speech disfluencies), and quality control procedures applied to
completed translations.
All final data are in Tab Delimited Format (TDF). TDF is compatible
with other transcription formats, such as the Transcriber format and
AG format, and it is easy to process. Each line of a TDF file
corresponds to a speech segment and contains 13 tab delimited field.A
source TDF file and its translation are the same except that the
transcript in the source TDF is replaced by its English translation.
--------------------------------------------------------------------------
End of Arabic-L: 09 Apr 2008
More information about the Arabic-l
mailing list