29.4070, FYI: October 2018 Newsletter - LDC

Fri Oct 19 02:39:11 UTC 2018

LINGUIST List: Vol-29-4070. Thu Oct 18 2018. ISSN: 1069 - 4875.

Subject: 29.4070, FYI: October 2018 Newsletter - LDC

Moderator: linguist at linguistlist.org (Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté)
Homepage: https://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Thu, 18 Oct 2018 22:38:00
From: Membership Office [ldc at ldc.upenn.edu]
Subject: October 2018 Newsletter - LDC

In this newsletter: 

Fall 2018 LDC Data Scholarship Recipients
Membership Year 2019 Publication Preview

New publications:

Concretely Annotated English Gigaword

TRAD Arabic-French Parallel Text -- Newswire

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation
Data 2009-2014

-----------------------------------------

Fall 2018 LDC Data Scholarship Recipients

Congratulations to the recipients of LDC's Fall 2018 Data Scholarships:

Utkrist Adhikari: University of Bonn (Germany); M.Sc, Computer Science.
Utkrist is awarded a copy of Treebank-2 for his research in named entity
recognition, super sense tagging, and semantic role labeling. 

Vitaliya Remneva: Higher School of Economics, National Research University
(Russia); M.Sc, System and Software Engineering. Vitaliya is awarded a copy of
ETS Corpus of Non-Native Written English for her work in author profiling
through natural language processing.

Tian Xiaoyu: Shanghai International Studies University (China); MA,
Linguistics. Tian is awarded a copy of Tagged Chinese Gigaword Version 2.0 for
her research in causative construction variations in Mainland Chinese, Taiwan
Chinese, and Singapore Chinese. 

W. Victor H. Yarlott: Florida International University (US); Ph.D., School of
Computing and Information Sciences. Victor is awarded a copy of ACE2005
Multilingual Training Corpus for his research in relation extraction. 

For information about the program, visit the Data Scholarship page. 

Membership Year 2019 Publication Preview

The 2019 Membership Year is fast approaching and plans for next year’s
publications are in progress. Among the expected releases are:

SRI Speech-Based Collaborative Learning Corpus: speech from over 100 US middle
school students performing collaborative learning tasks, includes audio
recordings, orthographic transcriptions, manual annotation of collaboration,
and related documentation

Chinese Abstract Meaning Representation (AMR): developed by Nanjing Normal
University and Brandeis University, semantic representation of approximately
10,000 Chinese sentences following the basic principles of AMR using web
source data from Chinese Treebank 8.0 (LDC2013T21)

Multilanguage conversational telephone speech: developed to support language
identification research in related languages (Arabic, East Asian, English,
Mandarin)

TAC KBP: English entity discovery and linking, nugget detection and event
argument data, Chinese slot-filling data

IARPA Babel Language Packs (telephone speech and transcripts): languages
include Amharic, Guarani, Igbo, and Lithuanian

HAVIC Med Progress Test data: web video, metadata, and annotations for
developing multimedia systems

BOLT: discussion forums, SMS, word-aligned and tagged data in all languages
(Chinese, Egyptian Arabic, English)
Check your inbox in the coming weeks for more information about membership
renewal.  

-----------------------------------------

New publications:

(1) Concretely Annotated English Gigaword was developed by Johns Hopkins
University's Human Language Technology Center of Excellence. It adds multiple
kinds and instances of automatically-generated syntactic, semantic, and
coreference annotations to English Gigaword Fifth Edition (LDC2011T07).
Concrete is a schema for representing structured, hierarchical, and
overlapping linguistic annotations. This release provides multiple tool
outputs producing the same annotation types as different annotation theories
under a shared tokenization.

Concretely Annotated English Gigaword contains the nearly ten million
documents (over four billion words) of the original English Gigaword Fifth
Edition, which consists of newswire stories from seven sources collected by
LDC between 1994-2010. 

Concretely Annotated English Gigaword is distributed via hard drive.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Any organization that licensed English Gigaword Fifth Edition
(LDC2011T07) or Annotated English Gigaword (LDC2012T21) may request a copy of
Concretely Annotated English Gigaword for a $250 media fee. Non-members may
license this data for a fee.

*

(2) TAC KBP English Regular Slot Filling - Comprehensive Training and
Evaluation Data 2009-2014 was developed by LDC and contains training and
evaluation data produced in support of the TAC KBP Slot Filling evaluation
track conducted from 2009 to 2014.

Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing and related applications by
providing a large test collection, common evaluation procedures, and a forum
for researchers to share their results. 

The regular English Slot Filling evaluation track involved mining information
about entities from text. In completing the task, participating systems and
LDC annotators searched a corpus for information on certain attributes (slots)
of person and organization entities and attempted to return all valid answers
(slot fillers) in the source collection. For more information about English
Slot Filling, please refer to the 2014 track home page.

This release contains queries, the 'manual runs' (human-produced responses to
the queries), and the final rounds of assessment results. 

TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation
Data 2009-2014 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

(3) TRAD Arabic-French Parallel Text -- Newswire  was developed by ELDA as
part of the PEA-TRAD project. It contains French translations of a subset of
approximately 20,000 Arabic words from NIST 2008 Open Machine Translation
(OpenMT) Evaluation (LDC2010T21). The purpose of the PEA-TRAD project
(Translation as a Support for Document Analysis) was to develop
speech-to-speech translation technology for multiple languages (e.g., Arabic,
Chinese, Pashto) from a variety of domains. 

This release consists of 813 segments (translations units) from 74 documents.
The Arabic source file contains 19,902 words and the French reference
translation contains 29,104 words.  The source data is Arabic newswire text
collected and translated into English by LDC. Information about the ELDA
translation team, translation guidelines, and validation results is contained
in the documentation accompanying this release.

TRAD Arabic-French Parallel Text -- Newswire is distributed via web download.

2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

*

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-4070	
----------------------------------------------------------