Arabic-L:LING:New LDC resources
Dilworth Parkinson
dilworth_parkinson at BYU.EDU
Fri Jul 27 19:21:13 UTC 2007
------------------------------------------------------------------------
Arabic-L: Fri 27 Jul 2007
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject:New LDC resources
-------------------------Messages-----------------------------------
1)
Date: 27 Jul 2007
From:ldc at ldc.upenn.edu
Subject:New LDC resources
(2) GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 is the
first part of the three-part GALE Phase 1 Arabic Broadcast News
Parallel Text, which, along with other corpora, was used as training
data in year 1 (Phase 1) of the DARPA-funded GALE program. This
corpus contains transcripts and English translations of 17 hours of
Arabic broadcast news programming selected from a variety of
sources. A manual selection procedure was used to choose data
appropriate for the GALE program, namely, news and conversation
programs focusing on current events. Stories on topics such as
sports, entertainment news, and stock market reports were excluded
from the data set.
The selected audio snippets were then carefully transcribed by LDC
annotators and professional transcription agencies following LDC's
Quick Rich Transcription specification. Manual sentence units/
segments (SU) annotation was also performed as part of the
transcription task. Three types of end of sentence SU are identified:
statement SU
question SU
incomplete SU
After transcription and SU annotation, the files were reformatted
into a human-readable translation format and were then assigned to
professional translators for careful translation. Translators
followed LDC's GALE translation guidelines, which describe the makeup
of the translation team, the source data format, the translation data
format, best practices for translating certain linguistic features
(such as names and speech disfluencies), and quality control
procedures applied to completed translations.
All final data are in Tab Delimited Format (TDF). TDF is compatible
with other transcription formats, such as the Transcriber format and
AG format, and it is easy to process. Each line of a TDF file
corresponds to a speech segment and contains 13 tab delimited
fields. The source TDF file and its translation are the same except
that the transcript in the source TDF is replaced by its English
translation. GALE Phase 1 Arabic Broadcast News Parallel Text - Part
1 is distributed via web download.
2007 Subscription Members will automatically receive two copies of
this corpus. 2007 Standard Members may request a copy as part of
their 16 free membership corpora. Nonmembers may license this data
for US$1500.
------------------------------------------------------------------------
--
End of Arabic-L: 27 Jul 2007
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20070727/8875fabf/attachment.htm>
More information about the Arabic-l
mailing list