28.1361, FYI: News from LDC

Mon Mar 20 20:23:50 UTC 2017

LINGUIST List: Vol-28-1361. Mon Mar 20 2017. ISSN: 1069 - 4875.

Subject: 28.1361, FYI: News from LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2017
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Yue Chen <yue at linguistlist.org>
================================================================

Date: Mon, 20 Mar 2017 16:23:30
From: Katie Kindle [ldc at ldc.upenn.edu]
Subject: News from LDC

 In this newsletter:

- BOLT Chinese Discussion Forum Parallel Training Data
- IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d
- Noisy TIMIT Speech
- GALE English-Chinese Parallel Aligned Treebank -- Training

New Corpora:

-BOLT Chinese Discussion Forum Parallel Training Data was developed by LDC and
consists of 1,876,799 tokens of Chinese discussion forum data collected for
the DARPA BOLT program along with their corresponding English translations.

The BOLT (Broad Operational Language Translation) program developed machine
translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging and chat
-- in Chinese, Egyptian Arabic and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking,
propbanking and co-reference.

The source data in this release consists of discussion forum threads harvested
from the Internet by LDC using a combination of manual and automatic
processes. The full source data collection is released as BOLT Chinese
Discussion Forums (LDC2016T05). Word-aligned and tagged data is released as
BOLT Chinese-English Word Alignment and Tagging - Discussion Forum Training
(LDC2016T19).

BOLT Chinese Discussion Forum Parallel Training Data is distributed via web
download. 

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

-IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 200 hours of Swahili conversational and
scripted telephone speech collected from 2012-2014 along with corresponding
transcripts.

The Babel program focuses on underserved languages and seeks to develop speech
recognition technology that can be rapidly applied to any human language to
support keyword search performance over large amounts of recorded speech.

The Swahili speech in this release represents that spoken in the Nairobi
dialect region of Kenya. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 65 years. Calls
were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and
inside a vehicle.

Transcripts are encoded in UTF-8. 

IARPA Babel Swahili Language Pack IARPA-babel202b-v1.0d is distributed via web
download.

2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- Noisy TIMIT Speech was developed by the Florida Institute of Technology and
contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic
Continuous Speech Corpus (LDC93S1) modified with different additive noise
levels. Only the audio has been modified; the original arrangement of the
TIMIT corpus is still as described by the TIMIT documentation.

The additive noise are white, pink, blue, red, violet and babble noise with
levels varying in 5 dB (decibel) steps, ranging from 5 to 50 dB. The color
noise types were generated artificially using MATLAB. The babble noise was
selected from a random segment of recorded babble speech scaled relative to
the power of the original TIMIT audio signal.

Noisy TIMIT Speech is distributed via web download. 

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

- GALE English-Chinese Parallel Aligned Treebank -- Training was developed by
LDC and contains 196,123 tokens of word aligned English and Chinese parallel
text with treebank annotations. This material was used as training data in the
DARPA GALE (Global Autonomous Language Exploitation) program.

Parallel aligned treebanks are treebanks annotated with morphological and
syntactic structures aligned at the sentence level and the sub-sentence level.
Such data sets are useful for natural language processing and related fields,
including automatic word alignment system training and evaluation,
transfer-rule extraction, word sense disambiguation, translation lexicon
extraction and cultural heritage and cross-linguistic studies. With respect to
machine translation system development, parallel aligned treebanks may improve
system performance with enhanced syntactic parsers, better rules and knowledge
about language pairs and reduced word error rate.

The English source data was translated into Chinese. Chinese and English
treebank annotations were performed independently. The parallel texts were
then word aligned. The material in this release corresponds to portions of the
treebanked data in OntoNotes 3.0 (LDC2009T24) and OntoNotes 4.0 (LDC2011T03).

This release consists of English source broadcast programming (CNN, NBC/MSNBC)
and web data collected by LDC in 2005 and 2006. 

GALE English-Chinese Parallel Aligned Treebank – Training is distributed via
web download. 

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810, Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2017
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $70,000. This money
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out the Fund Drive 2017 site!

http://funddrive.linguistlist.org/

We collect donations via the eLinguistics Foundation, a
registered 501(c) Non Profit organization with the federal tax
number 45-4211155. The donations can be offset against your
federal and sometimes your state tax return (U.S. tax payers
only). For more information visit the IRS Web-Site, or contact
your financial advisor.

Many companies also offer a gift matching program. Contact
your human resources department and send us the necessary form.

Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-28-1361	
----------------------------------------------------------