30.1748, FYI: April 2019 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Wed Apr 24 03:38:16 UTC 2019


LINGUIST List: Vol-30-1748. Tue Apr 23 2019. ISSN: 1069 - 4875.

Subject: 30.1748, FYI: April 2019 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

**************************************    LINGUIST List Support    **************************************
                                              Fund Drive 2019
                          29 years of LINGUIST List! The annual Fund Drive is on!
Please support the LINGUIST List to ensure we can continue to deliver important information to your mailbox.
                                           Every amount counts:
                                https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Tue, 23 Apr 2019 23:37:49
From: Membership Office [ldc at ldc.upenn.edu]
Subject: April 2019 Newsletter - LDC

 
In this newsletter: 

LDC at ICASSP 2019
LDC data and commercial technology development

New Publications:

BOLT Egyptian-English Word Alignment -- Discussion Forum Training
Chinese Abstract Meaning Representation 1.0
HAVIC MED Progress Test -- Videos, Metadata and Annotation

LDC at ICASSP 2019

LDC will be exhibiting at ICASSP 2019, held this year May 12-17 in Brighton,
UK. Stop by booth 5 to learn more about recent developments at the Consortium
and new publications.
LDC will post conference updates via our Twitter feed and Facebook page. We
hope to see you there! 

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC databases.
Non-member organizations, including non-member for-profit organizations,
cannot use LDC data to develop or test products for commercialization, nor can
they use LDC data in any commercial product or for any commercial purpose. LDC
data users should consult corpus-specific license agreements for limitations
on the use of certain corpora. Visit the Licensing page for further
information.

New publications:

(1) BOLT Egyptian-English Word Alignment -- Discussion Forum Training was
developed by LDC and consists of 400,448 words of Egyptian Arabic and English
parallel text enhanced with linguistic tags to indicate word relations.

The source data in this release consists of discussion forum threads harvested
from the Internet by LDC using a combination of manual and automatic processes
and is released as BOLT Arabic Discussion Forums (LDC2018T10).

The BOLT word alignment task was built on treebank annotation. Egyptian source
tree tokens for word alignment were automatically extracted from tree files of
BOLT Egyptian Arabic Treebank annotation on the discussion forum data. Human
annotators then followed LDC guidelines to link words and phrases in Arabic to
those in English.

BOLT Egyptian-English Word Alignment -- Discussion Forum Training is
distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee. 


(2) Chinese Abstract Meaning Representation 1.0 was developed by Brandeis
University and Nanjing Normal University and is comprised of semantic
representations of a set of Chinese sentences from the weblog and discussion
forum portions of Chinese Treebank 8.0 (LDC2013T21). Annotations were applied
to 10,149 sentences, with 176 sentences unannotated.

Abstract Meaning Representation (AMR) captures ''who is doing what to whom''
in a sentence. Each sentence is paired with a graph that represents its
whole-sentence meaning in a tree structure. Chinese AMR is based on the
annotation methodology developed for English with adaptations for handling
specific Chinese phenomena. The goal of the Chinese AMR project is to create a
large aligned AMR corpus, of which this data set is the first release. For
more information about the project, see the Chinese AMR homepage.

Chinese Abstract Meaning Representation 1.0 is distributed via web download. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for a fee.  


(3) HAVIC MED Progress Test -- Videos, Metadata and Annotation was developed
by LDC and is comprised of approximately 3,650 hours of user-generated videos
with annotation and metadata. 

In a collaboration with NIST (the National Institute of Standards and
Technology) to advance multimodal event detection and related technologies,
LDC developed a large, heterogeneous, annotated multimodal corpus for HAVIC
(the Heterogeneous Audio Visual Internet Collection) that was used in the
NIST-sponsored MED (Multimedia Event Detection) task for several years. HAVIC
MED Progress Test is a subset of that corpus, specifically, a collection of
event and background videos originally released to support the 2012-2015 MED
tasks. 

This release consists of videos of various events (event videos) and videos
completely unrelated to events (background videos) harvested by a large team
of human annotators. Each event video was manually annotated with a set of
judgments describing its event properties and other salient features.
Background videos were labeled with topic and genre categories. 

HAVIC MED Progress Test -- Videos, Metadata and Annotation is distributed via
hard drive. 

2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free membership
corpora. This corpus is a members-only release and is not available for
non-member licensing. Contact ldc at ldc.upenn.edu for information about
membership.


Membership Office
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
 



Linguistic Field(s): Computational Linguistics





 



------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-30-1748	
----------------------------------------------------------






More information about the LINGUIST mailing list