29.2587, FYI: June 2018 Newsletter - LDC

Tue Jun 19 13:41:29 UTC 2018

LINGUIST List: Vol-29-2587. Tue Jun 19 2018. ISSN: 1069 - 4875.

Subject: 29.2587, FYI: June 2018 Newsletter - LDC

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Helen Aristar-Dry, Robert Coté,
                                   Michael Czerniakowski)
Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           http://funddrive.linguistlist.org/donate/

Editor for this issue: Kenneth Steimel <ken at linguistlist.org>
================================================================

Date: Tue, 19 Jun 2018 09:41:15
From: Membership Office [ldc at ldc.upenn.edu]
Subject: June 2018 Newsletter - LDC

In this newsletter: 

LDC Catalog certified as CoreTrustSeal data repository 
LDC data and commercial technology development

New Publications:

BOLT Chinese SMS/Chat
Multi-Language Conversational Telephone Speech 2011 -- Central European
TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data
2009-2013
IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b

LDC Catalog certified as CoreTrustSeal data repository 

LDC is pleased to announce that the Catalog has been awarded the CoreTrustSeal
for recognition as a trustworthy data repository. This means that the Catalog
meets a series of standards covering data access, rights management, curation,
and storage developed by the ISCU World Data System and the Data Seal of
Approval. LDC joins the other 136 certified repositories around the globe in
the commitment to promote sustainable and trustworthy data infrastructures. 

LDC data and commercial technology development

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.

New publications:

(1) BOLT Chinese SMS/Chat was developed by LDC and consists of
naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected
through data donations and live collection involving native speakers of
Chinese. The corpus contains 14,877 conversations totaling 3,005,810 words
across 497,543 messages.

The BOLT (Broad Operational Language Translation) program developed machine
translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources – discussion forums, text messaging, and chat
– in Chinese, Egyptian Arabic, and English. The collected data was translated
and annotated for various tasks including word alignment, treebanking,
propbanking, and co-reference. The data in this release was collected using
two methods: new collection via LDC's collection platform, and donation of SMS
or chat archives from BOLT collection participants. 

BOLT Chinese SMS/Chat is distributed via web download.
2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(2) Multi-Language Conversational Telephone Speech 2011 -- Central European
was developed by LDC and is comprised of approximately 44 hours of telephone
speech in two distinct language varieties of Central Europe: Czech and Slovak.

The data were collected primarily to support research and technology
evaluation in automatic language identification, and portions of these
telephone calls were used in the NIST 2011 Language Recognition Evaluation
(LRE). Participants were recruited by native speakers who contacted
acquaintances in their social network. Those native speakers made one call, up
to 15 minutes, to each acquaintance. Human auditors labeled the calls for
callee gender, dialect type, and noise.

LDC has also released the following as part of the Multi-Language
Conversational Telephone Speech 2011 series:

- Slavic Group (LDC2016S11)
- Turkish (LDC2017S09)
- South Asian (LDC2017S14)
- Central Asian (LDC2018S03)

Multi-Language Conversational Telephone Speech 2011 -- Central European is
distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(3) TAC KBP English Entity Linking - Comprehensive Training and Evaluation
Data 2009-2013 was developed by LDC and contains training and evaluation data
produced in support of the TAC KBP English Entity Linking tasks in 2009, 2010,
2011, 2012, and 2013. It includes queries and gold standard entity type
information, Knowledge Base links, and equivalence class clusters for NIL
entities. Also included are the source documents for the queries,
specifically, English newswire, discussion forum, and web data. The
corresponding knowledge base is available as TAC KBP Reference Knowledge Base
(LDC2014T16). Also included in this package are the results of an Entity
Linking IAA (Inter-Annotator Agreement) study conducted in 2010.

TAC KBP encourages the development of systems that can match entities
mentioned in natural texts with those appearing in a knowledge base and
extract novel information about entities from a document collection and add it
to a new or existing knowledge base. English Entity Linking was first
conducted as part of the 2009 TAC KBP evaluations. Its goal is to measure
systems' ability to determine whether an entity, specified by a query, has a
matching node in a reference knowledge base (KB) and, if so, to create a link
between the two. If there is no matching node for a query entity in the KB, EL
systems are required to cluster the mention together with others referencing
the same entity. 
TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data
2009-2013 is distributed via web download.

2018 Subscription Members will automatically receive copies of this corpus.
2018 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(4) IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b was developed by
Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel
program. It contains approximately 191 hours of Cebuano conversational and
scripted telephone speech collected in 2013 and 2014 along with corresponding
transcripts.

The Cebuano speech in this release represents that spoken in the Cebu-North
Kana, Sialo, and Mindanao dialect regions of the Philippines. The gender
distribution among speakers is approximately equal; speakers' ages range from
16 years to 75 years. Calls were made using different telephones (e.g.,
mobile, landline) from a variety of environments including the street, a home
or office, a public place, and inside a vehicle.

IARPA Babel Cebuano Language Pack IARPA-babel301b-v2.0b is available via web
download.

2018 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2018
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            http://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-29-2587	
----------------------------------------------------------