29.4555, FYI: November 2018 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Fri Nov 16 08:54:21 UTC 2018


LINGUIST List: Vol-29-4555. Fri Nov 16 2018. ISSN: 1069 - 4875.

Subject: 29.4555, FYI: November 2018 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Fri, 16 Nov 2018 03:53:43
From: Membership Office [ldc at ldc.upenn.edu]
Subject: November 2018 Newsletter - LDC

 
In this newsletter:

Join LDC for Membership Year 2019
Spring 2019 Data Scholarship Program
Commercial use and LDC data

New publications:
AISHELL-1 
Avatar Education Portuguese 
BOLT Egyptian Arabic Treebank - Discussion Forum 
IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a
____________________________________________________________

Join LDC for Membership Year 2019
Membership Year 2019 (MY2019) is open and discounts are available for those
who keep their membership current and join early in the year. Now through
March 1, 2019, current MY2018 members who renew their LDC membership before
March 1 will receive a 10% discount off the membership fee. New or returning
organizations will receive a 5% discount through March 1. 

In addition to receiving new publications, current LDC members also enjoy the
benefit of licensing older data at reduced costs from our Catalog of over 750
holdings. Current-year for-profit members may use most data for commercial
applications. 

Plans for MY2019 publications are in progress. Among the expected releases
are:
- SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US
middle school students performing collaborative learning tasks, includes audio
recordings, orthographic transcriptions, manual annotation of collaboration,
and related documentation
- Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal
University and Brandeis University, semantic representation of approximately
10,000 Chinese sentences following the basic principles of AMR using web
source data from Chinese Treebank 8.0 (LDC2013T21)
- Multilanguage conversational telephone speech: developed to support language
identification research in related languages (Arabic, East Asian, English,
Mandarin)
- TAC KBP: English entity discovery and linking, nugget detection and event
argument data, Chinese slot-filling data
- CALLFRIEND Second Edition: updated releases with .wav format audio,
simplified directory structure and enhanced documentation and metadata
(English, Egyptian Arabic, Mandarin Chinese-Taiwan)
- HAVIC Med Progress Test data: English web video, metadata, and annotations
for developing multimedia systems
- IARPA Babel Language Packs (telephone speech and transcripts): languages
include Amharic, Guarani, Igbo, and Lithuanian
- BOLT: discussion forums, SMS, word-aligned and tagged data in all languages
(Chinese, Egyptian Arabic, English)

And, it’s not too late to join for MY2017 (through December 31, 2018) and
MY2018 (through December 31, 2019). Data sets from those years include 2010
NIST Speaker Recognition Evaluation Test Set, RATS Keyword Spotting and
Language Identification releases, CHiME, Noisy TIMIT Speech, Concretely
Annotated New York Times and English Gigaword, DIRHA English WSJ Audio,
LORELEI Amharic and Somali Language Packs and DEFT Spanish Treebank. For full
descriptions of all LDC data sets, browse our Catalog.  

Visit Join LDC for details on membership, user accounts and payment.

Spring 2019 Data Scholarship Program
Applications are now being accepted through January 15, 2019 for the Spring
2019 LDC Data Scholarship program which provides university students with
no-cost access to LDC data. Consult the LDC Data Scholarship page for more
information about program rules and submission requirements.

Commercial use and LDC data
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information. 

-----------------------------------------

New publications:

(1) AISHELL-1 was developed by Beijing Shell Shell Technology Co., Ltd. It
contains approximately 520 hours of Chinese Mandarin speech from 400 speakers
recorded simultaneously on three different devices with associated
transcripts.

The goal of the collection was to support speech recognition system
development in 11 domains, including smart homes, autonomous driving,
entertainment, finance, and science and technology. Participants read 500
sentences covering the domains; sentences were chosen for their speech and
phonetic characteristics. The speech was recorded in a quiet indoor
environment on a high fidelity microphone and two mobile phones (Android and
IOS). 

Speakers were recruited from different accent areas across China, including
North, South, and Yue-Gui-Min regions. There were 214 female speakers and 186
male speakers. Additional demographic information about the participants is
included in this release.

AISHELL-1 is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(2) Avatar Education Portuguese was developed by the University of Pernambuco
and consists of approximately 80 minutes of Brazilian Portuguese microphone
speech with phonetic and orthographic transcriptions. The data was developed
for Avatar Education, an animated virtual assistant designed to enhance
communication and interaction in educational contexts, such as online
learning.

The corpus contains 1,400 speakers (700 male, 700 female) who generated 1,400
utterances from read and spontaneous speech. Utterances were transcribed at
the word level (without time alignments) and at the phoneme level (with time
alignment labels).

Avatar Education Portuguese is distributed via web download. 

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(3) BOLT Egyptian Arabic Treebank - Discussion Forum was developed by LDC and
consists of Egyptian Arabic web discussion forum data with part-of-speech
annotation, morphology, gloss and syntactic tree annotation collected for the
DARPA Broad Operational Language Translation (BOLT) Program. 

The annotations in this release follow Penn Arabic Treebank (PATB) annotation
guidelines. There are two kinds of morphological analysis synchronized in the
corpus. LDC Standard Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01)
was used for Modern Standard Arabic tokens, and CALIMA (Columbia Arabic
Language and dIalect Morphological Analyzer) was used for Egyptian-Arabic
tokens.

This release contains 440,448 tokens before clitics were split and 508,548
tree tokens after clitics were split for treebank annotation. The source
material is web discussion forums collected by LDC from various sources.

The unannotated Egyptian Arabic source data is released as BOLT Arabic
Discussion Forums (LDC2018T10).

BOLT Egyptian Arabic Treebank - Discussion Forum is distributed via web
download. 

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(4) IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 201 hours of Telugu conversational and
scripted telephone speech collected in 2013 and 2014 along with corresponding
transcripts.
 
The Telugu speech in this release represents that spoken in the Central, East,
South, and North Telugu dialect regions of India. The gender distribution
among speakers is approximately equal; speakers' ages range from 16 years to
65 years. Calls were made using different telephones (e.g., mobile, landline)
from a variety of environments including the street, a home or office, a
public place, and inside a vehicle.
 
IARPA Babel Telugu Language Pack IARPA-babel303b-v1.0a is available via web
download.
 
2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

*

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-4555	
----------------------------------------------------------






More information about the LINGUIST mailing list