Arabic-L:LING:LDC GALE Phase 2 Arabic Broadcast News Speech and Transcripts
Dilworth Parkinson
dilworthparkinson at GMAIL.COM
Mon Aug 18 16:12:44 UTC 2014
------------------------------------------------------------------------
Arabic-L: Mon 18 Aug 2014
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
unsubscribe arabic-l ]
-------------------------Directory------------------------------------
1) Subject: LDC GALE Phase 2 Arabic Broadcast News Speech and Transcripts
-------------------------Messages-----------------------------------
1)
Date: 18 Aug 2014
From: Linguistic Data Consortium <ldc at ldc.upenn.edu>
Subject: LDC GALE Phase 2 Arabic Broadcast News Speech and Transcripts
(1) GALE Phase 2 Arabic Broadcast News Speech Part 1
<https://catalog.ldc.upenn.edu/LDC2014S07> was developed by LDC and is
comprised of approximately 165 hours of Arabic broadcast news speech
collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. Corresponding transcripts are released as GALE Phase
2 Arabic Broadcast News Transcripts Part 1 (LDC2014T17
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014T17>).
Broadcast audio for the GALE program was collected at LDC’s Philadelphia,
PA USA facilities and at three remote collection sites: Hong Kong
University of Science and Technology, Hong King (Chinese), Medianet (Tunis,
Tunisia) (Arabic), and MTC (Rabat, Morocco) (Arabic). The combined local
and outsourced broadcast collection supported GALE at a rate of
approximately 300 hours per week of programming from more than 50 broadcast
sources for a total of over 30,000 hours of collected broadcast audio over
the life of the program.
The broadcast recordings in this release feature news programs focusing
principally on current events from the following sources: Abu Dhabi TV, a
televisions station based in Abu Dhabi, United Arab Emirates; Al Alam News
Channel, based in Iran; Alhurra, a U.S. government-funded regional
broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar;
Dubai TV, a broadcast station in the United Arab Emirates; Al Iraqiyah, an
Iraqi television station; Kuwait TV, a national broadcast station in
Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station;
Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national
television station based in Saudi Arabia; and Syria TV, the national
television station in Syria.
This release contains 200 audio files presented in FLAC
<http://flac.sourceforge.net/>-compressed Waveform Audio File format
(.flac), 16000 Hz single-channel 16-bit PCM. Each file was audited by a
native Arabic speaker following Audit Procedure Specification Version 2.0
which is included in this release. The broadcast auditing process served
three principal goals: as a check on the operation of the broadcast
collection system equipment by identifying failed, incomplete or faulty
recordings; as an indicator of broadcast schedule changes by identifying
instances when the incorrect program was recorded; and as a guide for data
selection by retaining information about a program’s genre, data type and
topic.
GALE Phase 2 Arabic Broadcast News Speech Part 1 is distributed on three
DVD-ROM.
2014 Subscription Members will automatically receive two copies of this
data. 2014 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US$2000.
*
(2) GALE Phase 2 Arabic Broadcast News Transcripts Part 1
<https://catalog.ldc.upenn.edu/LDC2014T17> was developed by LDC and
contains transcriptions of approximately 165 hours of Arabic broadcast news
speech collected in 2006 and 2007 by LDC, MediaNet, Tunis, Tunisia and MTC,
Rabat, Morocco during Phase 2 of the DARPA GALE (Global Autonomous Language
Exploitation) program. Corresponding audio data is released as GALE Phase 2
Arabic Broadcast News Speech Part 1 (LDC2014S07
<http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2014S07>).
The transcript files are in plain-text, tab-delimited format (TDF) with
UTF-8 encoding, and the transcribed data totals 897,868 tokens. The
transcripts were created with the LDC-developed transcription tool, XTrans
<https://www.ldc.upenn.edu/language-resources/tools/xtrans>, a
multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings.
The files in this corpus were transcribed by LDC staff and/or by
transcription vendors under contract to LDC. Transcribers followed LDC's
quick transcription guidelines (QTR) and quick rich transcription
specification (QRTR) both of which are included in the documentation with
this release. QTR transcription consists of quick (near-)verbatim,
time-aligned transcripts plus speaker identification with minimal
additional mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries and manual
sentence unit annotation to the core components of a quick transcript.
Files with QTR as part of the filename were developed using QTR
transcription. Files with QRTR in the filename indicate QRTR transcription.
GALE Phase 2 Arabic Broadcast News Transcripts Part 1 is distributed via
web download.
2014 Subscription Members will automatically receive two copies of this
data on disc. 2014 Standard Members may request a copy as part of their 16
free membership corpora. Non-members may license this data for US$1500.
--------------------------------------------------------------------------
End of Arabic-L: 18 Aug 2014
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20140818/89324540/attachment.htm>
More information about the Arabic-l
mailing list