28.1890, FYI: News from LDC

Fri Apr 21 00:59:40 UTC 2017

LINGUIST List: Vol-28-1890. Thu Apr 20 2017. ISSN: 1069 - 4875.

Subject: 28.1890, FYI: News from LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2017
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================

Date: Thu, 20 Apr 2017 20:59:32
From: Katie Kindle [ldc at ldc.upenn.edu]
Subject: News from LDC

Announcements:

- LDC celebrates 25 years
- LDC data and commercial technology development

New publications:
- 2010 NIST Speaker Recognition Evaluation Test Set
- BOLT Egyptian Arabic SMS/Chat and Transliteration 
- CHiME2 Grid 

LDC celebrates 25 years

April 2017 marks the beginning of LDC’s 25th year as the leader in language
resource development and distribution. Founded in 1992, the Consortium has
grown from a data repository to a vibrant data center that creates, shares and
archives language resources. The Catalog continues to grow, boasting over 700
titles in more than 90 languages. With the support of members, licensees,
sponsors and collaborators, LDC has distributed over 120,000 copies of data to
more than 3,500 organizations worldwide. Our heartfelt thanks for your support
as we continue our mission to provide large quantities of diverse data,
research program support and high quality member services.

LDC data and commercial technology development

Any organization wishing to use LDC data to develop or test products for
commercialization or use LDC data in any commercial product or for any
commercial purpose, must first license the data as a For-Profit Member. Once
the data is licensed under the For-Profit Membership, the organization retains
perpetual rights to use the data for commercial technology development. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit our Licensing page for more information. 

New Corpora:

(1) 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and
NIST (National Institute of Standards and Technology). It contains 2,255 hours
of American English telephone speech and interview speech recorded over a
microphone channel used as test data in the NIST-sponsored 2010 Speaker
Recognition Evaluation (SRE).

The telephone speech segments include two-channel excerpts of approximately 10
seconds and 5 minutes. There are also summed-channel excerpts in the range of
5 minutes. The microphone excerpts are 3-15 minutes in duration. As in prior
evaluations, intervals of silence were not removed.

The 2010 evaluation includes not only conversational telephone speech (CTS)
recorded over ordinary telephone channels for the core training and test
conditions, but also CTS and conversational interview speech recorded over a
room microphone channel. Unlike prior evaluations, some of the conversational
telephone style speech was collected in a manner to produce particularly high,
or particularly low, vocal effort on the part of the speaker of interest. In
addition to evaluation data, this package also consists of answer keys, trial
and train files, development data and evaluation documentation.

2010 NIST Speaker Recognition Evaluation Test Set is distributed via hard
drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) BOLT Egyptian Arabic SMS/Chat and Transliteration was developed by LDC and
consists of naturally-occurring Short Message Service (SMS) and Chat (CHT)
data collected through data donations and live collection involving native
speakers of Egyptian Arabic. The corpus contains 5,691 conversations totaling
1,029,248 words across 262,026 messages. Messages were natively written in
either Arabic orthography or romanized Arabizi. A total of 1,856 Arabizi
conversations (287,022 words) were transliterated from the original romanized
Arabizi script into standard Arabic orthography and then reviewed, corrected
and normalized by LDC annotators according to ''Conventional Orthography for
Dialectal Arabic'' (CODA).

The BOLT (Broad Operational Language Translation) program developed machine
translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging and chat
-- in Chinese, Egyptian Arabic and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference.

BOLT Egyptian Arabic SMS/Chat and Transliteration is distributed via web
download.
2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(3) CHiME2 Grid was developed as part of The 2nd CHiME Speech Separation and
Recognition Challenge and contains approximately 120 hours of English speech
from a noisy living room environment. The CHiME Challenges focus on
distant-microphone automatic speech recognition (ASR) in real-world
environments.

CHiME2 Grid reflects the small vocabulary track of the CHiME2 Challenge. The
target utterances were taken from the Grid corpus and consist of 34 speakers
reading simple 6-word sequences. The Data is divided into training,
development and test sets.

CHiME2 Grid is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2017
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $70,000. This money
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out the Fund Drive 2017 site!

http://funddrive.linguistlist.org/

We collect donations via the eLinguistics Foundation, a
registered 501(c) Non Profit organization with the federal tax
number 45-4211155. The donations can be offset against your
federal and sometimes your state tax return (U.S. tax payers
only). For more information visit the IRS Web-Site, or contact
your financial advisor.

Many companies also offer a gift matching program. Contact
your human resources department and send us the necessary form.

Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-28-1890	
----------------------------------------------------------