32.2084, FYI: June 2021 Newsletter - LDC

Thu Jun 17 05:06:49 UTC 2021

LINGUIST List: Vol-32-2084. Thu Jun 17 2021. ISSN: 1069 - 4875.

Subject: 32.2084, FYI: June 2021 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn, Lauren Perkins
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Nils Hjortnaes, Joshua Sims, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Thu, 17 Jun 2021 01:05:12
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: June 2021 Newsletter - LDC

In this newsletter: 
LDC data and commercial technology development 

New Publications:
MyST Children’s Conversational Speech
BOLT Egyptian Arabic Treebank – Conversational Telephone Speech
________________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.
________________________________________

New publications:
(1) MyST Children’s Conversational Speech was developed by Boulder Learning
Inc. It contains 470 hours of English speech from 1371 students in grades 3-5
conversing with a virtual science tutor in eight areas of science instruction,
along with transcripts and a pronunciation dictionary. Data was collected in
two phases between 2008 and 2017. Spoken dialogs with the virtual tutor were
aligned to classroom instruction using the Full Option Science System, a
research-based science curriculum for grades K-8. Students conversed with the
virtual science tutor for 15-20 minutes. The tutor asked open-ended questions
about media presented on-screen, and students produced spoken answers. 

Data was collected in 10,496 sessions for a total of 227,567 utterances.
Approximately 45% of those utterances (102,433) were transcribed. Data is
divided into development, test, and train partitions for use with ASR systems.

MyST Children’s Conversational Speech is distributed via web download.

2021 Subscription Members will automatically receive copies of this corpus
provided they have submitted a completed copy of the special license
agreement. 2021 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

*

(2) BOLT Egyptian Arabic Treebank – Conversational Telephone Speech was
developed by LDC and consists of Egyptian Arabic conversational telephone
speech data with part-of-speech annotation, morphology, gloss, and syntactic
tree annotation. 

This release contains 153,171 tokens before clitics were split and 182,965
tree tokens after clitics were split for treebank annotation. The source data
was selected from conversational telephone speech collected by LDC for the
CALLHOME project that was transcribed and segmented into sentence units.

Annotations follow Penn Arabic Treebank guidelines which consist of: (a)
part-of-speech tagging that divides the text into lexical tokens and gives
relevant information about each token such as lexical category, inflectional
features, and a gloss; and (b) Arabic treebanking, which characterizes the
constituent structures of word sequences, provides categories for each
non-terminal node, and identifies null elements, co-reference, traces, and so
on.

The DARPA BOLT (Broad Operational Language Translation) program developed
machine translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging, and
chat -- in Chinese, Egyptian Arabic, and English. The collected data was
translated and annotated for various tasks including word alignment,
treebanking, propbanking, and co-reference.

BOLT Egyptian Arabic Treebank – Conversational Telephone Speech is distributed
via web download.

2021 Subscription Members will automatically receive copies of this corpus.
2021 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-32-2084	
----------------------------------------------------------