28.3417, FYI: August 2017 Newsletter - LDC

Tue Aug 15 18:42:34 UTC 2017

LINGUIST List: Vol-28-3417. Tue Aug 15 2017. ISSN: 1069 - 4875.

Subject: 28.3417, FYI: August 2017 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================

Date: Tue, 15 Aug 2017 14:42:20
From: Brian Taylor [ldc at ldc.upenn.edu]
Subject: August 2017 Newsletter - LDC

In this newsletter: 

Fall 2017 LDC Data Scholarship program 

LDC at Interspeech 2017

New Publications:

Multi-Language Conversational Telephone Speech 2011 -- South Asian

GALE Phase 4 Arabic Broadcast Conversation Speech

GALE Phase 4 Arabic Broadcast Conversation Transcripts

Fall 2017 LDC Data Scholarship program - September 15 deadline approaching

There is still time to apply to the Fall 2017 LDC Data Scholarship program.
Applications will be accepted through Friday September 15, 2017. The LDC Data
Scholarship program provides university students with access to LDC data at no
cost. Students must complete an application which consists of a data use
proposal and letter of support from their advisor.

For more information on application requirements and program rules, please
visit the LDC Data Scholarship page. 

Applicants can email their materials to the LDC Data Scholarship program. 

LDC at Interspeech 2017

LDC will once again be exhibiting at Interspeech, held this year August 20-24
in Stockholm, Sweden. Stop by booth 17 to learn more about recent developments
at the Consortium and new publications.

Also, be on the lookout for the following oral presentation by LDC:

Call My Net Corpus: A Multilingual Corpus for Evaluation of Speaker
Recognition Technology 
Karen Jones, Stephanie Strassel, Kevin Walker, David Graff, Jonathan Wright
Wednesday, August 3, 17:40-18:00 in the Agula Magna room  

LDC will post conference updates via our Twitter feed and Facebook page. We
hope to see you there!   

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 -- South Asian was
developed by LDC and is comprised of approximately 118 hours of telephone
speech in five distinct language varieties of South Asia (i.e. the Indian
sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were
collected primarily to support research and technology evaluation in automatic
language identification, and portions of these telephone calls were used in
the NIST 2011 Language Recognition Evaluation (LRE). LRE 2011 focused on
language pair discrimination for 24 languages/dialects, some which could be
considered mutually intelligible or closely related.

Participants were recruited by native speakers who contacted acquaintances in
their social network. Those native speakers made one call, up to 15 minutes,
to each acquaintance. The data was collected using LDC's telephone collection
infrastructure, comprised of three computer telephony systems. Human auditors
labeled calls for callee gender, dialect type, and noise. Demographic
information about the participants was not collected.

LDC has also released the following as part of the Multi-Language Conversation
Telephone Speech 2011 series: Slavic Group (LDC2016S11)  and Turkish
(LDC2017S09).

Multi-Language Conversational Telephone Speech 2011 -- South Asian is
distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee. 

*

(2) GALE Phase 4 Arabic Broadcast Conversation Speech was developed by LDC and
is comprised of approximately 75 hours of Arabic broadcast conversation speech
collected in 2008 and 2009 by LDC, MediaNet, Tunis, Tunisia and MTC, Rabat,
Morocco during Phase 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast
Conversation Transcripts (LDC2017T12).

This release contains 83 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Arabic speaker following Audit Procedure Specification
Version 2.0 which is included in this release.  

The broadcast conversation recordings in this release feature interviews,
call-in programs and roundtable discussions focusing principally on current
events from the following sources: Al Alam News Channel, based in Iran; Al
Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast station
based in the United Kingdom; Alnurra, a U.S. government-funded regional
broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al
Ordiniyah, a national broadcast station in Jordan; Dubai TV, a broadcast
station in the United Arab Emirates; Lebanese Broadcasting Corporation, a
Lebanese television station; Saudi TV, a national television station based in
Saudi Arabia; Syria TV, the national television station in Syria; and Tunisian
National TV, a national television station in Tunisia.

GALE Phase 4 Arabic Broadcast Conversation Speech is distributed via web
download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

*

(3) GALE Phase 4 Arabic Broadcast Conversation Transcripts was developed by
LDC and contains transcriptions of approximately 75 hours of Arabic broadcast
conversation speech collected in 2008 and 2009 by LDC, MediaNet, Tunis,
Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA GALE (Global
Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast
Conversation Speech (LDC2017S15).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 475,211 tokens. The files in this
corpus were transcribed by LDC staff and/or by transcription vendors under
contract to LDC. Transcribers followed LDC's quick transcription guidelines
(QTR) and quick rich transcription specification (QRTR) both of which are
included in the documentation with this release.

GALE Phase 4 Arabic Broadcast Conversation Transcripts is distributed via web
download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-28-3417	
----------------------------------------------------------