30.299, FYI: January 2019 Newsletter - LDC

Fri Jan 18 05:44:32 UTC 2019

LINGUIST List: Vol-30-299. Fri Jan 18 2019. ISSN: 1069 - 4875.

Subject: 30.299, FYI: January 2019 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Fri, 18 Jan 2019 00:42:25
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: January 2019 Newsletter - LDC

January 2019 Newsletter
In this newsletter:
Renew Your LDC Membership Today

New publications:

BOLT Arabic Discussion Forum Parallel Training Data
SRI Speech-Based Collaborative Learning Corpus
TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation
Data 2014-2015

Renew Your LDC Membership Today:

Join LDC while membership savings are still available. Now through March 1,
2019, all organizations receive a discount on the 2019 membership fee (up to
10%) when they choose to join the Consortium or renew their membership. This
year’s planned publications include Multilanguage Conversational Telephone
Speech (telephone speech in languages/dialects considered mutually
intelligible or closely related), IARPA Babel Language Packs (telephone speech
and transcripts in underserved languages), Chinese Abstract Meaning
Representation Corpus, SRI Speech-Based Collaborative Learning Corpus, data
from BOLT, HAVIC, DEFT, TAC KBP and more. Membership remains the most
economical way to access LDC releases. Visit Join LDC for details on
membership options and benefits.

New publications:

(1) BOLT Arabic Discussion Forum Parallel Training Data was developed by LDC
and consists of 1,169,599 tokens of Egyptian Arabic discussion forum data
collected for the DARPA BOLT program along with their corresponding English
translations.

LDC supported the BOLT program by collecting informal data sources --
discussion forums, text messaging, and chat -- in Chinese, Egyptian Arabic,
and English. The collected data was translated and annotated for various tasks
including word alignment, treebanking, propbanking, and co-reference.

The source data in this release consists of discussion forum threads harvested
from the Internet by LDC using a combination of manual and automatic
processes. The full source data collection is released as BOLT Arabic
Discussion Forums (LDC2018T10).

Data was manually selected for translation according to several criteria,
including linguistic features and topic features. The files were then
segmented into sentence units, formatted into a human-readable translation
format, and assigned to translation vendors. Translators followed LDC's BOLT
translation guidelines. Bilingual LDC staff performed quality control
procedures on the completed translations.

BOLT Arabic Discussion Forum Parallel Training Data is available as a web
download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(2) SRI Speech-Based Collaborative Learning Corpus was developed by SRI
International and is comprised of approximately 120 hours of English speech
from 134 US middle school students working collaboratively. The data set also
contains orthographic transcriptions, manual annotation of collaboration, log
files, and supporting documentation.

This collection was part of a project investigating the utility of a
speech-based learning analytics approach to collaborative learning. The goal
was to determine whether detectable patterns exist in student speech that
correlate with collaborative learning indicators and to provide a means of
assessing collaboration quality. The participants were students in middle
schools (grades six, seven, and eight) located in California. Students worked
in groups of three on sets of short mathematics problems based on the
''cloze'' task in which each student was assigned one blank and each problem
required the students to work together and talk to each other to coordinate
their three answers. The problems were presented on iPads with a custom
software application and the audio data was captured by both head-mounted and
table-top microphones.

SRI Speech-Based Collaborative Learning Corpus is available as a web download.

2019 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2019
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(3) TAC KBP Entity Discovery and Linking - Comprehensive Training and
Evaluation Data 2014-2015 was developed by LDC and contains training and
evaluation data produced in support of the TAC KBP Entity Discovery and
Linking (EDL) tasks in 2014 and 2015. It includes queries, knowledge base (KB)
links, equivalence class clusters for NIL entities, and entity type
information for each of the queries. Also included in this data set are all
necessary source documents as well as BaseKB - the second reference KB that
was adopted for use by EDL in 2015. The first EDL reference KB to which 2014
EDL data are linked is available separately as TAC KBP Reference Knowledge
Base (LDC2014T16).

The goal of the EDL track is to conduct end-to-end entity extraction, linking,
and clustering. For producing gold standard data, given a document collection,
annotators (1) extract (identify and classify) entity mentions (queries), link
them to nodes in a reference KB and (2) perform cross-document co-reference on
within-document entity clusters that cannot be linked to the KB. 

Source data consists of Chinese, English, and Spanish newswire and web text
collected by LDC. The EDL 2014 task involved English data only. Chinese and
Spanish data were added in the 2015 task. 

TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation
Data 2014-2015 is available as a web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-30-299	
----------------------------------------------------------