27.2260, FYI: News from LDC

Wed May 18 15:15:41 UTC 2016

LINGUIST List: Vol-27-2260. Wed May 18 2016. ISSN: 1069 - 4875.

Subject: 27.2260, FYI: News from LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Robert Coté, Sara Couture)
Homepage: http://linguistlist.org

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
                   25 years of LINGUIST List!
Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Ashley Parker <ashley at linguistlist.org>
================================================================

Date: Wed, 18 May 2016 11:15:34
From: Katie Kindle [ldc at ldc.upenn.edu]
Subject: News from LDC

In this newsletter:

LDC at LREC 2016

New publications:

- SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
- GALE Phase 4 Chinese Broadcast Conversation Speech
- GALE Phase 4 Chinese Broadcast Conversation Transcripts

LDC at LREC 2016
LDC will attend the 10th Language Resource Evaluation Conference (LREC2016),
hosted by ELRA, the European Language Resource Association. The conference
will be held in Portorož, Slovenia from May 23-28 and features a broad range
of sessions on language resources and human language technologies research.
Seven LDC staff members will be presenting current work on topics including
trends in HLT research, building language resources for autism spectrum
disorders, data management plans, rapid development of morphological analyzers
for typologically diverse languages, selection criteria for low resource
language programs, multi-language speech collection for NIST LRE, novel
incentives for collecting data and annotation from people, and more.

Following the conference, LDC’s presented papers and posters will be available
on LDC’s Papers Page.

New Corpora

(1) SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing consists of
data, tools, system results, and publications associated with the 2014 and
2015 tasks on Broad-Coverage Semantic Dependency Parsing (SDP) conducted in
conjunction with the International Workshop on Semantic Evaluation (SemEval)
and was developed by the SDP task organizers.

SemEval is an ongoing series of evaluations of computational semantic analysis
systems intended to explore the nature of meaning in language. It evolved from
the Senseval word sense disambiguation series to include semantic analysis
tasks outside of word sense disambiguation.

This release is based on English, Chinese and Czech data from the following
resources: Treebank-2 LDC95T17, Proposition Bank I LDC2004T14, NomBaank v 1.0
LDC2008T23 and CCGBank LDC2005T13 (English); Chinese Treebank (e.g., Chinese
Treebank 8.0 LDC2013T21) (Chinese); and Prague Dependency Treebank (e.g.,
Prague Dependency Treebank 2.0, LDC2006T01) (Czech).

The results are presented as graphs in three target representations:
MRS-Derived Semantic Dependencies (DM), Enju Predicate–Argument Structures
(PAS), and Prague Semantic Dependencies (PSD). As a fourth, additional target
representation CCGbank was converted to semantic dependency graphs (in the
subdirectory ‘ccd’).

SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing is distributed via
web download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

(2) GALE Phase 4 Chinese Broadcast Conversation Speech was developed by LDC
and is comprised of approximately 172 hours of Mandarin Chinese broadcast
conversation speech collected in 2008 by LDC and Hong Kong University of
Science and Technology during Phase 4 of the DARPA GALE (Global Autonomous
Language Exploitation) Program.

Corresponding transcripts are released as GALE Phase 4 Chinese Broadcast
Conversation Transcripts (LDC2016T12).

The broadcast conversation recordings in this release feature interviews,
call-in programs and roundtable discussions focusing principally on current
events and are contained in 236 audio files presented in FLAC-compressed
Waveform Audio File format (.flac), 16000 Hz single-channel 16-bit PCM. Each
file was audited by a native Chinese speaker following Audit Procedure
Specification Version 2.0 which is included in this release.

GALE Phase 4 Chinese Broadcast Conversation Speech is distributed via web
download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

(3) GALE Phase 4 Chinese Broadcast Conversation Transcripts was developed by
LDC and contains transcriptions of approximately 172 hours of Chinese
broadcast conversation speech collected in 2008 by LDC and Hong Kong
University of Science and Technology during Phase 4 of the DARPA GALE (Global
Autonomous Language Exploitation) Program.

Corresponding audio data is released as GALE Phase 4 Chinese Broadcast
Conversation Speech (LDC2016S03).

The transcript files are in plain-text, tab-delimited format (TDF) with UTF-8
encoding, and the transcribed data totals 2,259,952 tokens.

The files in this corpus were transcribed by LDC staff and/or by transcription
vendors under contract to LDC. Transcribers followed LDC’s quick transcription
guidelines (QTR) and quick rich transcription specification (QRTR). QTR
transcription consists of quick (near-) verbatim, time-aligned transcripts
plus speaker identification with minimal additional mark-up. QRTR adds
additional structural information such as topic boundaries and manual sentence
unit annotation.

GALE Phase 4 Chinese Broadcast Conversation Transcripts is distributed via web
download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
                       Fund Drive 2016
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/

This year the LINGUIST List hopes to raise $79,000. This money 
will go to help keep the List running by supporting all of our 
Student Editors for the coming year.

Don't forget to check out Fund Drive 2016 site!

http://funddrive.linguistlist.org/

For all information on donating, including information on how to 
donate by check, money order, PayPal or wire transfer, please visit:
http://funddrive.linguistlist.org/donate/

The LINGUIST List is under the umbrella of Indiana University and
as such can receive donations through Indiana University Foundation. We
also collect donations via eLinguistics Foundation, a registered 501(c)
Non Profit organization with the federal tax number 45-4211155. Either
way, the donations can be offset against your federal and sometimes your
state tax return (U.S. tax payers only). For more information visit the
IRS Web-Site, or contact your financial advisor.

Many companies also offer a gift matching program, such that
they will match any gift you make to a non-profit organization.
Normally this entails your contacting your human resources department
and sending us a form that the Indiana University Foundation fills in
and returns to your employer. This is generally a simple administrative
procedure that doubles the value of your gift to LINGUIST, without
costing you an extra penny. Please take a moment to check if
your company operates such a program.

Thank you very much for your support of LINGUIST!

----------------------------------------------------------
LINGUIST List: Vol-27-2260	
----------------------------------------------------------