Arabic-L:LING:New LDC corpora

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Tue Mar 5 17:42:26 UTC 2013


------------------------------------------------------------------------
Arabic-L: Tue 05 Mar 2013
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject:New LDC corpora

-------------------------Messages-----------------------------------
1)
Date: 05 Mar 2013
From:from Linguistic Data Consortium ldc at ldc.upenn.edu
Subject:New LDC corpora

 (1) GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 was developed
by LDC and is comprised of approximately 123 hours of Arabic broadcast
conversation speech collected in 2006 and 2007 by LDC as part of the DARPA
GALE (Global Autonomous Language Exploitation) Program. Broadcast audio for
the DARPA GALE program was collected at LDC’s Philadelphia, PA USA
facilities and at three remote collection sites. The combined local and
outsourced broadcast collection supported GALE at a rate of approximately
300 hours per week of programming from more than 50 broadcast sources for a
total of over 30,000 hours of collected broadcast audio over the life of
the program.
LDC's local broadcast collection system is highly automated, easily
extensible and robust and capable of collecting, processing and evaluating
hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA)
satellite receivers, commercial direct satellite systems (DSS) such as
DirecTV, direct broadcast satellite (DBS) receivers, and cable television
(CATV) feeds. The mapping between receivers and recorders is dynamic and
modular; all signal routing is performed under computer control, using a
256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V
format and are then processed to extract audio, to generate keyframes and
compressed audio/video, to produce time-synchronized closed captions (in
the case of North American English) and to generate automatic speech
recognition (ASR) output.
The broadcast conversation recordings in this release feature interviews,
call-in programs and round table discussions focusing principally on
current events from several sources. This release contains 143 audio files
presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Arabic speaker following Audit Procedure Specification
Version 2.0 which is included in this release. The broadcast auditing
process served three principal goals: as a check on the operation of LDCs
broadcast collection system equipment by identifying failed, incomplete or
faulty recordings; as an indicator of broadcast schedule changes by
identifying instances when the incorrect program was recorded; and as a
guide for data selection by retaining information about a program's genre,
data type and topic.

*
(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 1 was
developed by LDC and contains transcriptions of approximately 123 hours of
Arabic broadcast conversation speech collected in 2006 and 2007 by LDC,
MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the
DARPA GALE (Global Autonomous Language Exploitation) program. The source
broadcast conversation recordings feature interviews, call-in programs and
round table discussions focusing principally on current events from several
sources.
The transcript files are in plain-text, tab-delimited format (TDF) with
UTF-8 encoding, and the transcribed data totals 752,747 tokens. The
transcripts were created with the LDC-developed transcription tool, XTrans,
a multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings.
The files in this corpus were transcribed by LDC staff and/or by
transcription vendors under contract to LDC. Transcribers followed LDCs
quick transcription guidelines (QTR) and quick rich transcription
specification (QRTR) both of which are included in the documentation with
this release. QTR transcription consists of quick (near-)verbatim,
time-aligned transcripts plus speaker identification with minimal
additional mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries and manual
sentence unit annotation to the core components of a quick transcript.
Files with QTR as part of the filename were developed using QTR
transcription. Files with QRTR in the filename indicate QRTR transcription.

--------------------------------------------------------------------------
End of Arabic-L: 05 Mar 2013
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20130305/12259f67/attachment.htm>


More information about the Arabic-l mailing list