31.3167, FYI: October 2020 Newsletter - Linguistic Data Consortium

Mon Oct 19 01:30:09 UTC 2020

LINGUIST List: Vol-31-3167. Sun Oct 18 2020. ISSN: 1069 - 4875.

Subject: 31.3167, FYI: October 2020 Newsletter - Linguistic Data Consortium

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Sun, 18 Oct 2020 21:29:54
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: October 2020 Newsletter - Linguistic Data Consortium

In this newsletter: 
Fall 2020 Data Scholarship Recipients
Membership Year 2021 Publication Preview
LDC data and commercial technology development 

New Publications:
Global TIMIT Learner Treebank English
Corpus of Law, Academic, and News
IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b

________________________________________
Fall 2020 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2020 data scholarships:

Nicole Dodd: University of California, Davis (USA); MA, Linguistics.

Satwik Dutta: University of Texas at Dallas (USA); PhD, Electrical
Engineering.  

Pedram Hosseini: George Washington University (USA); PhD., Computer Science. 

Mariano Maisonnave: Universidad Nacional del Sur (Argentina); PhD, Computer
Science. 

Mark Sullivan: California State University, Los Angeles (USA); Masters,
Applied and Advanced Studies in Education. 

For information about the program, visit the Data Scholarships page.

Membership Year 2021 publication preview
The 2021 Membership Year is just around the corner and plans for next year’s
publications are in progress. Among the expected releases are:
- Global TIMIT Mandarin Chinese
- Columbia Games Corpus
- My Science Tutor Children’s Conversational Speech
- The SSNCE Database of Tamil Dysarthric Speech
- Icelandic Parliamentary Speech
- LORELEI, BOLT, and TAC-KBP data
Check your inbox in the coming weeks for more information about membership
renewal. 

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.

________________________________________
New publications:
(1) Global TIMIT Learner Treebank English was developed by LDC and LAIX Inc.
and consists of approximately 24 hours of L1 and L2 English read speech and
transcripts. It is comprised of two separate data sets of 50 speakers reading
120 sentences from Treebank-3 (LDC99T42). Among the 120 sentences, 20
sentences were read by all speakers, 40 sentences were read by 10 speakers,
and 60 sentences were read by one speaker, for a total of 3220 sentence types.

It is distributed via web download. 

Non-members may license this data for a fee.

*

(2) Corpus of Law, Academic and News consists of 400 Persian documents divided
into three genres: legal, academic, and news. The legal section contains texts
from official publications, including the civil penal code, the criminal penal
code, and the constitution of the Islamic Republic of Iran. The academic
sub-corpus is comprised of published academic abstracts in various
disciplinary areas, such as Art and Humanities, Social Sciences, and Natural
Sciences. The news sub-corpus was extracted from an archive of ten Iranian
news outlets spanning the period 2010-2020.

It is distributed via web download. 

Non-members may license this data for a fee. 

*

(3) IARPA Babel Mongolian Language Pack IARPA-babel401b-v2.0b was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 204 hours of Halh Mongolian conversational
and scripted telephone speech collected in 2014 along with corresponding
transcripts. The gender distribution among speakers is approximately equal;
speakers' ages range from 16 years to 61 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of environments
including the street, a home or office, a public place, and inside a vehicle.

It is distributed via web download. 

Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-31-3167	
----------------------------------------------------------