33.1770, FYI: May 2022 Newsletter - LDC

Tue May 17 23:33:30 UTC 2022

LINGUIST List: Vol-33-1770. Tue May 17 2022. ISSN: 1069 - 4875.

Subject: 33.1770, FYI: May 2022 Newsletter - LDC

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Billy Dickson
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Everett Green, Sarah Goldfinch, Nils Hjortnaes,
      Joshua Sims, Billy Dickson, Amalia Robinson, Matthew Fort
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: Tue, 17 May 2022 19:33:19
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: May 2022 Newsletter - LDC

In this newsletter: 
30th Anniversary Highlight: Penn Treebank 

New publications:
NUBUC
Samrómur Icelandic Speech 1.0 

--
30th Anniversary Highlight: Penn Treebank 
LDC’s Catalog features classic corpora responsible for critical advances in
human language technology that continue to influence researchers. Among them
are the Penn Treebank releases, Treebank-2 (LDC96T7) and Treebank-3
(LDC99T42). 

The Penn Treebank project (1989-1996) produced seven million words tagged for
part-of-speech, three million words of parsed text, over two million words
annotated for predicate-argument structure and 1.6 million words of
transcribed speech annotated for speech disfluencies (Taylor et al., 2003).
Source material represents a diverse range of data, including Wall Street
Journal (WSJ) articles, the Brown Corpus and Switchboard telephone
conversations. 

Penn Treebanks are used for a wide range of purposes, including the creation
and training of parsers and taggers, work on machine translation and speech
recognition, and research concerning joint syntactic and semantic role
labeling. Their ongoing influence is evidenced by the popularity of Treebank-3
(LDC99T42), which continues to be one of LDC’s top ten most distributed
corpora in the Catalog. In addition, the WSJ section has served as a model for
treebanks across many languages (Nivre, 2008).

The Penn Treebank has inspired related annotation schemes, such as Proposition
Bank, the Penn Discourse Treebank project, and word alignment annotation. In
addition, LDC has developed revised English treebank guidelines resulting in
the re-issue of the WSJ section (English News Text Treebank: Penn Treebank
Revised (LDC2015T13)) and treebanked web text (e.g., English Web Treebank
(LDC2012T13) and BOLT English Translation Treebank – Chinese Discussion Forum
(LDC2020T09)).   

Penn Treebank corpora and its related releases are available for licensing to
LDC members and nonmembers. For more information about licensing LDC data,
visit Obtaining Data. 

--
New publications:
(1)  NUBUC (NyU-BU contextually controlled stories Corpus) was developed by
New York University, Max Planck Institute for Empirical Aesthetics and Boston
University. It contains approximately three hours of English read speech from
eight stories focused on linguistic keywords that were created specifically
for this corpus, along with transcripts, syntactic annotations, and corpus
metadata.

Recordings are 11-12 minutes in duration, for a total of about 90 minutes of
continuous speech per speaker.

NUBUC is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus.
2022 Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data at no cost.
*
(2) Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab,
Reykjavik University in cooperation with Almannarómur, Center for Language
Technology. The corpus contains 145 hours of Icelandic prompted speech from
8,392 speakers representing 100,000 utterances.

Speech data was collected between October 2019 and May 2021 using the Samrómur
website which displayed prompts to participants. The prompts were mainly from
The Icelandic Gigaword Corpus, which includes text from novels, news, plays,
and from a list of location names in Iceland. Additional prompts were taken
from the Icelandic Web of Science and others were created by combining a name
followed by a question or a demand. Prompts and speaker metadata are are
included in the corpus.

Samrómur Icelandic Speech 1.0 is distributed via web download.  

2022 Subscription Members will automatically receive copies of this corpus
provided they have submitted a completed copy of the special license
agreement. 2022 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for a fee.

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2020 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-33-1770	
----------------------------------------------------------