29.2115, FYI: May 2018 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Tue May 15 22:25:35 UTC 2018


LINGUIST List: Vol-29-2115. Tue May 15 2018. ISSN: 1069 - 4875.

Subject: 29.2115, FYI: May 2018 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Tue, 15 May 2018 18:24:35
From: LDC Membership Office [ldc at ldc.upenn.edu]
Subject: May 2018 Newsletter - LDC

 
In this newsletter: 

New Publications:

Rhythm and Pitch

GALE Phase 4 Arabic Broadcast News Speech

GALE Phase 4 Arabic Broadcast News Transcripts

New publications:

(1) Rhythm and Pitch contains approximately 27 minutes of spontaneous English
conversations and radio news stories annotated with the Rhythm and Pitch (RaP)
scheme. Speech data for annotation was taken from two corpora released by LDC,
CALLHOME American English Speech (LDC97S42) and Boston University Radio Speech
Corpus (LDC96S36).

The RaP system permits the capture of both intonational and rhythmic aspects
of speech. Four labeling tiers are used for annotating speech prosody. These
tiers carry information about the syllabic organization and orthography of the
speech, its rhythmic structure, tonal patterns, and other information. More
information about the RaP system is available on the RaP homepage.

Speech data are presented as flac compressed 16-bit wav files. The Boston data
are one channel 16kHz files, while the CALLHOME data are either one or two
channel 8kHz files. Annotations are UTF-8 encoded Praat TextGrids.

Rhythm and Pitch is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) GALE Phase 4 Arabic Broadcast News Speech was developed by LDC and is
comprised of approximately 37 hours of Arabic broadcast news speech collected
in 2008 and 2009 by LDC and MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco
during Phase 4 of the DARPA GALE (Global Autonomous Language Exploitation)
Program.

Corresponding transcripts are released as GALE Phase 4 Arabic Broadcast News
Transcripts (LDC2018T14).

The recordings in this release feature news broadcasts focusing principally on
current events from the following sources: Abu Dhabi TV, a television station
based in Abu Dhabi, United Arab Emirates; Al Arabiya, a news television
station based in Dubai; Al Baghdadya, an Iraqi broadcast programmer; Alhurra,
a U.S. government-funded regional broadcaster; Al Iraqiyah, an Iraqi
television station; Aljazeera, a regional broadcaster located in Doha, Qatar;
Al Ordiniyah, a national broadcast station in Jordan; Kuwait TV, a national
broadcast station based in Kuwait; Radio Sawa, a U.S. government-funded
regional broadcaster; Saudi TV, a national television station based in Saudi
Arabia; Syria TV, the national television station in Syria; and Yemen TV, a
television station based in Yemen.

This release contains 51 audio files presented in FLAC-compressed Waveform
Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each file was
audited by a native Arabic speaker following Audit Procedure Specification
Version 2.0 which is included in this release.

GALE Phase 4 Arabic Broadcast News Speech is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(3) GALE Phase 4 Arabic Broadcast News Transcripts was developed by LDC and
contains transcriptions of approximately 37 hours of Arabic broadcast news
speech collected in 2008 and 2009 by the Linguistic Data Consortium (LDC),
MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 4 of the DARPA
GALE (Global Autonomous Language Exploitation) program.

Corresponding audio data is released as GALE Phase 4 Arabic Broadcast News
Speech (LDC2018S05).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 204,735 tokens. The transcripts were
created with the LDC tool XTrans, which supports manual transcription and
annotation of audio recordings. 

GALE Phase 4 Arabic Broadcast News Transcripts is distributed via web
download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.


Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            http://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-2115	
----------------------------------------------------------






More information about the LINGUIST mailing list