28.4326, FYI: October 2017 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Fri Oct 20 15:02:26 UTC 2017


LINGUIST List: Vol-28-4326. Fri Oct 20 2017. ISSN: 1069 - 4875.

Subject: 28.4326, FYI: October 2017 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Fri, 20 Oct 2017 11:02:16
From: Brian Taylor [ldc at ldc.upenn.edu]
Subject: October 2017 Newsletter - LDC

 
In this newsletter: 

LDC Awards Fall Data Scholarships

Membership Year 2018 Publication Preview

New Publications:

RATS Keyword Spotting

English Web Treebank Propbank

Ancient Chinese Corpus

MWE-Aware English Dependency Corpus Version 2.0 

LDC Awards Fall Data Scholarships:

LDC is pleased to award fifteen data scholarships to students this fall.
Recipients are from eight countries and a variety of academic disciplines.
Twenty unique data sets are awarded to the students for their work in diverse
applications including machine translation, abstractive text summarization
using recurrent neural networks, speech recognition for multiple languages,
semantic role labeling for social data, text summarization, speaker
recognition for forensic applications, and more. Please look to LDC’s social
media pages for upcoming announcements highlighting each recipient and their
intended research.  Congratulations to all of our recipients!

Membership Year 2018 Publication Preview

The 2018 Membership Year is just around the corner and plans for next year’s
publications are in progress. Among the expected releases are: 

- Multilanguage conversational telephone speech: developed to support language
identification research in related languages (Central Asian, Central European
language groups)
- DIRHA (Distant-speech Interaction for Robust Home Applications): Wall Street
Journal read speech with noise and reverberation, suitable for various
multi-microphone signal processing and distant speech recognition tasks
- TRAD corpora: Chinese-French and Arabic-French parallel text (newswire, web
data)
- IARPA Babel Language Packs (telephone speech and transcripts): languages
include Cebuano, Guarani, Kazakh, Lithuanian, Telugu, Tok Pisin 
- BOLT: discussion forums, SMS, word-aligned, and tagged data in all languages
(Egyptian Arabic, English, Chinese)
- DEFT: Spanish Treebank (newswire, web data)
- RATS Language Identification data set (Dari, Farsi, Levantine Arabic,
Pashto, Urdu; degraded audio signals) TAC KBP: comprehensive English source
and entity linked data (broadcast, telephone speech, newswire, web data)
- German children’s handwriting (longitudinal study of weekly writing in
classroom setting with enhanced output for specific spelling patterns)
  
Check your inbox in the coming weeks for more information about membership
renewal.

New publications:

(1) RATS Keyword Spotting was developed by LDC and is comprised of
approximately 3,100 hours of Levantine Arabic and Farsi conversational
telephone speech with automatic and manual annotation of speech segments,
transcripts, and keywords generated from transcript content. The corpus was
created to provide training, development, and initial test sets for the
keyword spotting (KWS) task in the DARPA RATS (Robust Automatic Transcription
of Speech) program. 

The source audio consists of conversational telephone speech recordings
collected by LDC: (1) data collected for the RATS program from Levantine
Arabic and Farsi speakers; and (2) material from Levantine Arabic QT Training
Data Set 5, Speech (LDC2006S29), and (3) CALLFRIEND Farsi Second Edition
Speech (LDC2014S01). Transcripts of calls were either produced or available
from the source corpora. Potential target keywords were selected from the
transcripts based on word frequencies to fall within a range of target-word
likelihood per hour of speech. The selected words were manually reviewed to
confirm that each was a regular or multi-word expression of more than three
syllables.

RATS Keyword Spotting is distributed via hard drive.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) English Web Treebank Propbank was developed by  University of Colorado
Boulder - CLEAR (Computational Language and Education Research) and provides
predicate-argument structure annotation for English Web Treebank (LDC2012T13).

The goal of Propbank (or proposition bank) annotation is to develop
annotations with information about basic semantic propositions. English Web
Treebank Propbank provides semantic role annotation and predicate sense
disambiguation for roughly 50,000 predicates, corresponding to all verbs, all
adjectives in equational clauses, and all nouns considered to be predicative.
Mark-up is in the ''unified'' propbank annotation format, which combines
representations in nouns, verbs, and adjectives. The source data consists of
weblogs, newsgroups, email, reviews, and questions-answers. 

English Web Treebank Propbank is distributed via web download. 

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

 
(3)  Ancient Chinese Corpus was developed at Nanjing Normal University. It
contains word-segmented and part-of-speech tagged text from Zuozhuan, an
ancient Chinese work believed to date from the Warring States Period (475-221
BC). This release is part of a continuing project to develop a large,
part-of-speech tagged ancient Chinese corpus. It consists of 180,000 Chinese
characters and 195,000 segment units (including words and punctuation). The
part-of-speech tag set was developed by Nanjing Normal University and contains
17 tags. The files are presented in UTF-8 plain text files using traditional
Chinese script.

Ancient Chinese Corpus is distributed via web download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(4) MWE-Aware English Dependency Corpus Version 2.0 was developed by the Nara
Institute of Science and Technology Computational Linguistics Laboratory and
consists of English compound function words annotated in dependency format.
The data is derived from OntoNotes Release 5.0 (LDC2013T19). 

Version 2.0 adds annotations of named entities (persons, locations,
organizations) into dependency trees that are aware of compound function
words. Version 1.0 is available from LDC as MWE-Aware English Dependency
Corpus (LDC2017T01). 

MWEs (multiword expressions) were identified in OntoNotes' phrase structure
trees and each MWE was established as a single subtree. Those phrase structure
subtrees were then converted to a dependency structure (the Stanford
dependencies) in CoNLL format. The data is split into 1,728 phrase structure
trees as *.parse files and a single 14-column tab separated dependency as a
*.conll file. Both file types are encoded as UTF-8.

MWE-Aware English Dependency Corpus Version 2.0 is distributed via web
download.

2017 Subscription Members will receive copies of this corpus. 2017 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-28-4326	
----------------------------------------------------------






More information about the LINGUIST mailing list