29.821, FYI: February 2018 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Tue Feb 20 21:15:37 UTC 2018


LINGUIST List: Vol-29-821. Tue Feb 20 2018. ISSN: 1069 - 4875.

Subject: 29.821, FYI: February 2018 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================


Date: Tue, 20 Feb 2018 16:15:14
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: February 2018 Newsletter - LDC

 
In this newsletter: 

Only two weeks left to enjoy 2018 membership discounts
Spring 2018 LDC Data Scholarship recipients
LDC data and commercial technology development

New Publications:
Multi-Language Conversational Telephone Speech 2011 -- Central Asian
LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text

TAC KBP Comprehensive English Source Corpora 2009-2014

IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e

Only two weeks left to enjoy 2018 membership discounts

There is still time to save on 2018 membership fees. Through March 1, all
organizations receive a discount on the 2018 membership fee (up to 10%) when
they choose to join or renew.   

For more information on membership benefits, visit Join LDC.
 
Spring 2018 LDC Data Scholarship recipients

Congratulations to the recipients of LDC's Spring 2018 Data Scholarship:

Margarida Madaleno: London School of Economics, PhD Economic Geography.
Madelano is awarded a copy of Treebank 3 for her research in emotional
well-being. 

Gary Munnelly: Trinity College Dublin, PhD Computer Science and Statistics.
Munnelly is awarded a copy of the New York Times Annotated Corpus for his
research in named entity recognition and disambiguation in cultural heritage
data sets. 

Barlian Henryanu Prasetio: University of Miyazaki, PhD Environmental Robotics.
Prasetio is awarded copies of SUSAS and SUSAS Transcripts for his work in
voice stress recognition systems. 

For information about the program, visit the Data Scholarship page.

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.

New publications:

(1) Multi-Language Conversational Telephone Speech 2011 – Central Asian was
developed by LDC and is comprised of approximately 37 hours of telephone
speech in three distinct language varieties of Central Asia: Dari, Farsi and
Pashto.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation
(LRE). Participants were recruited by native speakers who contacted
acquaintances in their social network. Those native speakers made one call, up
to 15 minutes, to each acquaintance.
LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:

- Slavic Group (LDC2016S11)
- Turkish (LDC2017S09)
- South Asian (LDC2017S14)

Multi-Language Conversational Telephone Speech 2011 – Central Asian is
distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(2) LORELEI Amharic Representative Language Pack - Monolingual and Parallel
Text was developed by LDC and is comprised of approximately 25 million words
of monolingual Amharic text, approximately 600,000 of which are translated
into English. Another 80,000 words are also translated from English into
Amharic. The LORELEI (Low Resource Languages for Emergent Incidents) Program
is concerned with building human language technology for low resource
languages in the context of emergent situations like natural disasters or
disease outbreaks. 

Data was collected in the following genres: discussion forums, news,
reference, social network and weblog. Both monolingual text collection and
parallel text creation involved a combination of manual and automatic methods,
which are detailed in the included documentation. All harvested content was
initially converted from its original HTML form into a relatively uniform XML
format. Also included in this release are two tools: one to recreate original
source data from the processed XML material and the other to condition text
data users download from Twitter.

LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
is distributed via web download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(3) TAC KBP Comprehensive English Source Corpora 2009-2014 was developed by
LDC and contains the 3,877,207 English source documents used in support of the
TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of
workshops organized by the National Institute of Standards and Technology
(NIST). TAC was developed to encourage research in natural language processing
and related applications by providing a large test collection, common
evaluation procedures, and a forum for researchers to share their results.
Through its various evaluations, the Knowledge Base Population (KBP) track of
TAC encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new or
existing knowledge base.

The source data consists of newswire, broadcast material, and web text
collected by LDC. Documents are released as a collection of zip files for
overall compactness, and ease and efficiency of use. When unpacked, the
documents are all UTF-8 text files with a basic markup structure.

TAC KBP Comprehensive English Source Corpora 2009-2014 is distributed via web
download.

2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for a fee.

(4) IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 200 hours of Tok Pisin conversational and
scripted telephone speech collected in 2013 along with corresponding
transcripts.

The Tok Pisin speech in this release represents that spoken in the Papuan
dialect region of Papua New Guinea. The gender distribution among speakers is
approximately equal; speakers' ages range from 16 years to 65 years. Calls
were made using different telephones (e.g., mobile, landline) from a variety
of environments including the street, a home or office, a public place, and
inside a vehicle.

IARPA Babel Tok Pisin Language Pack IARPA-babel207b-v1.0e is available via web
download.

2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:
            http://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-29-821	
----------------------------------------------------------






More information about the LINGUIST mailing list