35.1533, FYI: May 2024 Newsletter - LDC

Fri May 17 16:05:07 UTC 2024

LINGUIST List: Vol-35-1533. Fri May 17 2024. ISSN: 1069 - 4875.

Subject: 35.1533, FYI: May 2024 Newsletter - LDC

Moderators: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Daniel Swanson, Maria Lucero Guillen Puon, Zackary Leech, Lynzie Coburn, Natasha Singh, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Steven Moran <steve at linguistlist.org>

LINGUIST List is hosted by Indiana University College of Arts and Sciences.
================================================================

Date: 16-May-2024
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: May 2024 Newsletter - LDC

In this newsletter:
LDC at LREC-COLING 2024

New publications:
Call My Net 1
Automatic Content Extraction for Portuguese
________________________________________
LDC at LREC-COLING 2024
LDC will be exhibiting at LREC-COLING 2024 hosted by the European
Language Resources Association (ELRA) and the International Committee
on Computational Linguistics (ICCL) May 20-25 in Turin, Italy. Stop by
our table to learn more about recent developments at the Consortium
and the latest publications.

LDC staff members will also be presenting current work on topics
including Spanless Event Annotation for Corpus-Wide Complex Event
Understanding; Schema Learning Corpus: Data and Annotation Focused on
Complex Events; and KoFREN: Comprehensive Korean Word Frequency Norms
Derived from Large Scale Free Speech Corpora.

LDC will post conference updates via social media. We look forward to
seeing you in Italy!
________________________________________

New publications:
Call My Net 1 was developed by LDC and contains 364 hours of
conversational telephone speech in four languages (Tagalog, Cebuano,
Cantonese, and Mandarin) collected in 2015 from 221 native speakers
located in the Philippines and China along with metadata and speaker
demographic information. Recordings and data from this collection were
used to support the NIST 2016 Speaker Recognition Evaluation.

Speakers made 10 telephone calls each to people within their existing
social networks, using different handsets and under a variety of noise
conditions. Speakers were connected through a robot operator to carry
on casual conversations on topics of their choice. All recordings were
manually audited to confirm language and speaker requirements. The
documentation for this release includes metadata about phone type,
noise conditions, and call quality. Speaker demographic information on
year of birth, sex, and native language is also included.

2024 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

*

Automatic Content Extraction for Portuguese was developed at INESC TEC
- Instituto de Engenharia de Sistemas e Computadores, Tecnologia e
Ciência and consists of automatic Brazilian Portuguese and European
Portuguese translations of the English text and annotations in ACE
2005 Multilingual Training Corpus (LDC2006T06).

ACE 2005 Multilingual Training Corpus was developed by LDC to support
the Automatic Contract Extraction (ACE) program, specifically, by
providing training data for the 2005 technology evaluation. It
contains 1,800 files of mixed genre text in Arabic, English, and
Chinese annotated for entities, relations, and events. The objective
of the ACE program was to develop automatic content extraction
technology to support automatic processing of human language in text
form. Text genres included newswire, broadcast news, broadcast
conversation, weblog, discussion forums, and conversational telephone
speech.

For this translation, the English data was partitioned into training,
development, and test sets. The documents were split into sentences
and each event mention was assigned to its sentence. Source sentences
and their annotations were translated into Brazilian Portuguese using
Google Translate and into European Portuguese using DeepL Translate.
An alignment algorithm and a parallel corpus word aligner were used to
handle mismatches between translated annotations and their translated
sentences.

2024 members can access this corpus through their LDC account.
Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

Please consider donating to the Linguist List https://give.myiu.org/iu-bloomington/I320011968.html

LINGUIST List is supported by the following publishers:

Cambridge University Press http://www.cambridge.org/linguistics

De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton

Equinox Publishing Ltd http://www.equinoxpub.com/

John Benjamins http://www.benjamins.com/

Lincom GmbH https://lincom-shop.eu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Wiley http://www.wiley.com

----------------------------------------------------------
LINGUIST List: Vol-35-1533
----------------------------------------------------------