31.2805, FYI: September 2020 Newsletter - Linguistic Data Consortium

Wed Sep 16 03:28:03 UTC 2020

LINGUIST List: Vol-31-2805. Tue Sep 15 2020. ISSN: 1069 - 4875.

Subject: 31.2805, FYI: September 2020 Newsletter - Linguistic Data Consortium

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Lauren Perkins, Nils Hjortnaes, Yiwen Zhang, Joshua Sims
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Tue, 15 Sep 2020 23:27:44
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: September 2020 Newsletter - Linguistic Data Consortium

In this newsletter: 

New publications:
BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and
Conversational Telephone Speech
LORELEI Tigrinya Incident Language Pack
Chinese Lexical Resources for Gender, Number, Animacy

New publications:
(1) BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and
Conversational Telephone Speech was developed by the University of Colorado,
Boulder – CLEAR (Computational Language and Education Research) and consists
of propbank and verb sense disambiguation annotation on English discussion
forum (DF), SMS/Chat, and conversational telephone speech data. Annotation was
applied to each predicate verb tree in LDC’s BOLT phrase structure treebanks.
PropBank provides a layer of semantic annotation over treebank and was
performed on all three genres. DF and SMS/Chat data were also annotated for
verb sense disambiguation using Verbnet 3.2 classes.  

BOLT English PropBank and Sense – Discussion Forum, SMS/Chat and
Conversational Telephone Speech is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(2) LORELEI Tigrinya Incident Language Pack was developed by LDC and is
comprised of approximately 4.5 million words of Tigrinya monolingual text,
25,000 words of English monolingual text, 235,000 words of parallel and
comparable Tigrinya-English text, and 50,000 words of data annotated for
Entity Discovery and Linking and for Situation Frames. It contains all of the
text data, annotations, supplemental resources, and related software tools for
the Tigrinya language that were used in the DARPA LORELEI / LoReHLT 2017
Evaluation. 

Data was collected from news, social network, weblog, newsgroup, discussion
forum, and reference material. Entity Detection and Linking and Situation
Frame annotations identified “entities,” “needs” (such as a need for food),
and “issues” (such as civil unrest) to be detected by systems for scoring
purposes. Situation frame analysis was designed to extract basic information
that would be useful for planning a disaster response effort. 

LORELEI Tigrinya Incident Language Pack is distributed via web download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.

(3) Chinese Lexical Resources for Gender, Number, Animacy was developed by LDC
and consists of gender, number, and animacy lexicons produced in support of
the DARPA DEFT program. Gender, number, and animacy are lexical indicators
useful for named entity tagging, including the detection of person mentions in
text.

This corpus was created by extracting information from newswire texts in
Chinse Gigaword Fifth Edition (LDC2011T13) in the following steps: (1)
segmenting source documents into sentences; (2) converting any traditional
Chinese script to simplified Chinese; (3) tagging all sentences for
parts-of-speech; (4) developing queries to detect patterns; and (5) building
lexicons based on frequency counts and entity types. 

The resulting resources include dictionaries of Chinese animate nominals and
names; Chinese nominals and name with gender and number predicted; and other
dictionaries of Chinese nominals, names, verbs, and pronouns. Each dictionary
contains frequency information as well as the features in question. 

Chinese Lexical Resources for Gender, Number, Animacy is distributed via web
download. 

2020 Subscription Members will automatically receive copies of this corpus.
2020 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 

Membership Coordinator
Linguistic Data Consortium
E: ldc at ldc.upenn.edu

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-31-2805	
----------------------------------------------------------