Arabic-L:LING:LDC new Arabic publications

Dilworth Parkinson dilworthparkinson at GMAIL.COM
Thu Sep 19 06:30:43 UTC 2013


------------------------------------------------------------------------
Arabic-L: Fri 19 Sep 2013
Moderator: Dilworth Parkinson <dilworth_parkinson at byu.edu>
[To post messages to the list, send them to arabic-l at byu.edu]
[To unsubscribe, send message from same address you subscribed from to
listserv at byu.edu with first line reading:
           unsubscribe arabic-l                                      ]

-------------------------Directory------------------------------------

1) Subject: LDC new Arabic publications

-------------------------Messages-----------------------------------
1)
Date: 19 Sep 2013
From: Linguistic Data Consortium ldc at ldc.upenn.edu
via<http://support.google.com/mail/bin/answer.py?hl=en&ctx=mail&answer=1311182>
 byu.edu
Subject: LDC new Arabic publications

(1)* *GALE Phase 2 Arabic Broadcast Conversation Speech Part
2<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013S07>
was
developed by LDC and is comprised of approximately 128 hours of Arabic
broadcast conversation speech collected in 2007 by LDC as part of the DARPA
GALE (Global Autonomous Language Exploitation) Program. The data was
collected at LDC’s Philadelphia, PA USA facilities and at three remote
collection sites. The combined local and outsourced broadcast collection
supported GALE at a rate of approximately 300 hours per week of programming
from more than 50 broadcast sources for a total of over 30,000 hours of
collected broadcast audio over the life of the program.****

LDC's local broadcast collection system is highly automated, easily
extensible and robust and capable of collecting, processing and evaluating
hundreds of hours of content from several dozen sources per day. The
broadcast material is served to the system by a set of free-to-air (FTA)
satellite receivers, commercial direct satellite systems (DSS) such as
DirecTV, direct broadcast satellite (DBS) receivers, and cable television
(CATV) feeds. The mapping between receivers and recorders is dynamic and
modular; all signal routing is performed under computer control, using a
256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V
format and are then processed to extract audio, to generate keyframes and
compressed audio/video, to produce time-synchronized closed captions (in
the case of North American English) and to generate automatic speech
recognition (ASR) output.****

The broadcast conversation recordings in this release feature interviews,
call-in programs and round table discussions focusing principally on
current events from several sources. This release contains 141 audio files
presented in .wav, 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Arabic speaker following Audit Procedure Specification
Version 2.0 which is included in this release.****

GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 is distributed on
2 DVD-ROM.

2013 Subscription Members will automatically receive two copies of this
data.  2013 Standard Members may request a copy as part of their 16 free
membership corpora.  Non-members may license this data for US$2000.


****

(2) GALE Phase 2 Arabic Broadcast Conversation Transcripts Part
2<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2013T17>
was
developed by LDC and contains transcriptions of approximately 128 hours of
Arabic broadcast conversation speech collected in 2007 by LDC, MediaNet,
Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE
(Global Autonomous Language Exploitation) program. The source broadcast
conversation recordings feature interviews, call-in programs and round
table discussions focusing principally on current events from several
sources.****

The transcript files are in plain-text, tab-delimited format (TDF) with
UTF-8 encoding, and the transcribed data totals 763,945 tokens. The
transcripts were created with the LDC-developed transcription tool,
XTrans<http://www.ldc.upenn.edu/tools/XTrans/downloads/>,
a multi-platform, multilingual, multi-channel transcription tool that
supports manual transcription and annotation of audio recordings. ****

The files in this corpus were transcribed by LDC staff and/or by
transcription vendors under contract to LDC. Transcribers followed LDC’s
quick transcription guidelines (QTR) and quick rich transcription
specification (QRTR) both of which are included in the documentation with
this release. QTR transcription consists of quick (near-)verbatim,
time-aligned transcripts plus speaker identification with minimal
additional mark-up. It does not include sentence unit annotation. QRTR
annotation adds structural information such as topic boundaries and manual
sentence unit annotation to the core components of a quick transcript.****

GALE Phase 2 Arabic Broadcast Conversation Transcripts - Part 2 is
distributed via web download.****

2013 Subscription Members will automatically receive two copies of this
data on disc.  2013 Standard Members may request a copy as part of their 16
free membership corpora.  Non-members may license this data for US$1500.


--------------------------------------------------------------------------
End of Arabic-L: 19 Sep 2013
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/arabic-l/attachments/20130919/424d655f/attachment.htm>


More information about the Arabic-l mailing list