Arabic-L;LING:New LDC resources
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Thu Mar 20 20:20:11 UTC 2008
------------------------------------------------------------------------
Arabic-L: Mon 20 Mar 2008
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New LDC resources
-------------------------Messages-----------------------------------
1)
Date: 20 Mar 2008
From:ldc at ldc.upenn.edu
Subject:New LDC resources
GALE Phase 1 Arabic Blog Parallel Text was prepared by the LDC and
consists of 102K words (222 files) of Arabic blog text and its English
translation from thirty-three sources. This release was used as
training data in Phase 1 of the DARPA-funded GALE program.
The task of preparing this corpus involved four stages of work: data
scouting, data harvesting, formatting, and data selection.
Data scouting involved manually searching the web for suitable blog
text. Data scouts were assigned particular topics and genres along
with a production target in order to focus their web search. Formal
annotation guidelines and a customized annotation toolkit helped data
scouts to manage the search process and to track progress.
Data scouts logged their decisions about potential text of interest
(sites, threads and posts) to a database. A nightly process queried
the annotation database and harvested all designated URLs. Whenever
possible, the entire site was downloaded, not just the individual
thread or post located by the data scout.
Once the text was downloaded, its format was standardized so that the
data could be more easily integrated into downstream annotation
processes. Typically a new script was required for each new domain
name that was identified. After scripts were run, an optional manual
process corrected any remaining formatting problems.
The selected documents were then reviewed for content suitability
using a semi-automatic process. A statistical approach was used to
rank a document's relevance to a set of already-selected documents
labeled as "good." An annotator then reviewed the list of relevance-
ranked documents and selected those which were suitable for a
particular annotation task or for annotation in general.
After files were selected, they were reformatted into a human-readable
translation format, and the files were then assigned to professional
translators for careful translation. Translators followed LDC's GALE
Translation guidelines, which describe the makeup of the translation
team, the source, data format, the translation data format, best
practices for translating certain linguistic features (such as names
and speech disfluencies), and quality control procedures applied to
completed translations.
All final data are in Tab Delimited Format (TDF). TDF is compatible
with other transcription formats, such as the Transcriber format and
AG format, and it is easy to process. Each line of a TDF file
corresponds to a speech segment and contains 13 tab delimited field.A
source TDF file and its translation are the same except that the
transcript in the source TDF is replaced by its English translation.
GALE Phase 1 Arabic Blog Parallel Text is distributed via web download.
2008 Subscription Members will automatically receive two copies of
this corpus on disc. 2008 Standard Members may request a copy as part
of their 16 free membership corpora. Nonmembers may license this data
for US$1500
--------------------------------------------------------------------------
End of Arabic-L: 20 Mar 2008
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20080320/c8d254a9/attachment.htm>
More information about the Arabic-l
mailing list