33.2576, FYI: August 2022 Newsletter - LDC

Tue Aug 23 07:50:13 UTC 2022

LINGUIST List: Vol-33-2576. Tue Aug 23 2022. ISSN: 1069 - 4875.

Subject: 33.2576, FYI: August 2022 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Billy Dickson
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Everett Green, Sarah Goldfinch, Nils Hjortnaes,
        Joshua Sims, Billy Dickson, Amalia Robinson, Matthew Fort
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Hosted by Indiana University

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Tue, 23 Aug 2022 07:49:32
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: August 2022 Newsletter - LDC

In this newsletter: 
Fall 2022 LDC Data Scholarship Program
30th Anniversary Highlight: The LDC Gigawords

________________________________________

New publication:
HAVIC MED Novel 2 Test – Videos, Metadata and Annotation

Fall 2022 LDC Data Scholarship Program 
Student applications for the Fall 2022 LDC Data Scholarship program are being
accepted now through September 15, 2022. This program provides eligible
students with no-cost access to LDC data. Students must complete an
application consisting of a data use proposal and letter of support from their
advisor. For application requirements and program rules, visit the LDC Data
Scholarships page. 

30th Anniversary Highlight: The LDC Gigawords 
Giga: a combining form meaning “billion,” used in the formation of compound
words (Source: https://www.dictionary.com/browse/giga-)

LDC’s Gigaword corpora are a natural outgrowth of its vast decades-long
multi-language newswire collection. Newswire data was originally collected,
annotated, and distributed for use in many sponsored projects and was also
released through the LDC catalog in tailored data sets. Then came the idea of
making LDC’s entire newswire collection available by language with a simple,
minimal markup to support a broad range of NLP/HLT tasks. The first Arabic,
Chinese, and English Gigaword editions were released in 2003; subsequent
cumulative releases through fifth editions in 2011 represent LDC’s newswire
collection spanning 1994-2010 in those languages. French and Spanish Gigawords
were first published in 2006, culminating in the release of third editions in
2011, likewise covering newswire collected by LDC through 2010.

The community has used, and continues to use, these data sets in numerous
ways. Automatic text summarization is a favorite, and current work in this
area applies deep learning principles (see, e.g., Gao et al. 2020, English).
Gigawords are also useful for text source classification (Huang et al. 2003,
Chinese), information extraction (Lan et al. 2020, Arabic), knowledge
extraction and distributional semantics (Napoles et al. 2012, English), and
natural language understanding (Ganitkevitch 2013, English), among other
fields. Recent variations like the annotated and concretely annotated English
Gigawords add syntactic, semantic, and coreference annotations to this billion
word text collection. 

All Gigaword corpora are available for licensing by Consortium members and
non-members. Visit Obtaining Data for more information.

________________________________________

New publication:

HAVIC MED Novel 2 Test – Videos, Metadata and Annotation is comprised of 6,200
hours of user-generated videos with annotation and metadata developed by LDC
for the 2015 NIST Multimedia Event Detection tasks. The data consists of
videos of various events (event videos) and videos completely unrelated to
events (background videos). Each event video was manually annotated with
judgments describing its event properties and other salient features.
Background videos were labeled with topic and genre categories.

HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation is distributed via
web download. 

2022 Subscription Members will automatically receive copies of this corpus.
2022 Standard Members may request a copy as part of their 16 free membership
corpora. This corpus is a members-only release and is not available for
non-member licensing. Contact ldc at ldc.upenn.edu for information about
membership.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104 

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-33-2576	
----------------------------------------------------------