32.2993, FYI: September 2021 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Wed Sep 22 03:59:45 UTC 2021


LINGUIST List: Vol-32-2993. Tue Sep 21 2021. ISSN: 1069 - 4875.

Subject: 32.2993, FYI: September 2021 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn, Lauren Perkins
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Nils Hjortnaes, Joshua Sims, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Tue, 21 Sep 2021 23:48:09
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: September 2021 Newsletter - LDC

 
In this newsletter: 
New Publications:
RATS Speaker Identification
Classical Arabic Dictionary
DiscAlign for Penn and RST Discourse Treebanks
________________________________________

New publications:
(1) RATS Speaker Identification was developed by LDC and is comprised of
approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto, and Urdu
conversational telephone speech with annotations of speech segments. The audio
was retransmitted over eight channels, for 17,000 hours of total speech. The
corpus was created to provide training and development sets for the speaker
identification task in the DARPA RATS (Robust Automatic Transcription of
Speech) program.   

The source audio consists of conversational telephone speech recordings
collected by LDC specifically for the RATS program from Levantine Arabic,
Pashto, Urdu, Farsi, and Dari native speakers. Annotations on the audio files
include start time, end time, speech activity detection (SAD) label, SAD
provenance, speaker ID, speaker ID provenance, language ID, and language ID
provenance. 

RATS Speaker Identification is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus.
2021 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(2) Classical Arabic Dictionary consists of approximately one hundred million
words of Arabic collected from texts dating between 431 and 1104 CE,
principally books and essays, along with word occurrences, source documents,
and related metadata.

The dictionary is presented in three formats: plain text in UTF-8 encoding,
plain text in CP1256 encoding, and as an SQL database file. Source documents
are presented in UTF-8 and CP1256 encodings.

Classical Arabic Dictionary is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus.
2021 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(3) DiscAlign for Penn and RST Discourse Treebanks was developed by Saarland
University. It consists of alignment information for the discourse annotations
contained in Penn Discourse Treebank Version 2.0 (LDC2008T05) (PDTB 2.0) and
RST Discourse Treebank (LDC2002T07) (RST-DT). PDTB 2.0 and RST-DT annotations
overlap for 385 newspaper articles in sections 6, 11, 13, 19 and 23 of the
Wall Street Journal corpus contained in Treebank-2 (LDC95T7). DiscAlign for
Penn and RST Discourse Treebanks contains approximately 6,700 alignments
between PDTB 2.0 and RST-DT relations. 

DiscAlign for Penn and RST Treebanks is available at no cost to all licensees
of PDTB 2.0 and RST-DT and appears in their download queues associated with
these corpora as DiscAlign_Penn_RST_DTB_LDC2021T16.zip. 

*

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104 
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-32-2993	
----------------------------------------------------------






More information about the LINGUIST mailing list