34.26, FYI: December 2022 Newsletter - LDC

Fri Jan 6 09:05:02 UTC 2023

LINGUIST List: Vol-34-26. Fri Jan 06 2023. ISSN: 1069 - 4875.

Subject: 34.26, FYI: December 2022 Newsletter - LDC

Moderator: Malgorzata E. Cavar, Francis Tyers (linguist at linguistlist.org)
Managing Editor: Lauren Perkins
Team: Helen Aristar-Dry, Steven Franks, Everett Green, Sarah Robinson, Joshua Sims, Jeremy Coburn, Daniel Swanson, Matthew Fort, Maria Lucero Guillen Puon, Billy Dickson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================

Date: 
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: December 2022 Newsletter - LDC

In this newsletter:
LDC 2023 membership discounts now available
Approaching deadline for Spring 2023 data scholarship applications
30th Anniversary Highlight: AMR
________________________________________

New publications:
CAMIO Transcription Languages
Global TIMIT Thai
Third DIHARD Challenge Evaluation

LDC 2023 membership discounts now available
Now through March 1, 2023, current 2022 members receive a 10% discount
for renewing their membership, and new or returning organizations
receive a 5% discount. Membership remains the most economical way to
access current and past LDC releases. Consult Join LDC for details on
membership options and benefits.

Approaching deadline for Spring 2023 data scholarship applications
Attention students: don’t miss out on the chance to receive no-cost
access to LDC data for your research. Applications for Spring 2023
data scholarships are due January 15, 2023. For more information on
requirements and program rules, see LDC Data Scholarships.

________________________________________

New publications:
CAMIO Transcription Languages was developed by LDC and contains nearly
70,000 images of machine printed text with corresponding annotations
and transcripts in 13 languages: Arabic, Chinese, English, Farsi,
Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and
Vietnamese. This corpus is a subset of data created for a broader
effort to support the development and evaluation of optical character
recognition and related technologies for 35 languages across 24 unique
script types.

Most images were annotated for text localization, resulting in over
2.3M line-level bounding boxes; 1250 images per language were also
annotated with orthographic transcriptions of each line plus
specification of reading order, yielding over 2.4M tokens of
transcribed text. The resulting annotations are represented in an XML
output format defined for this corpus. Data for each language is
partitioned into test, train, or validation sets.

2022 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
Global TIMIT Thai consists of 112 hours of read speech and
time-aligned transcripts in Standard Thai from 50 speakers (33 female,
17 male) reading 120 sentences selected from the Thai National Corpus,
the Thai Junior Encyclopedia, and Thai Wikipedia, for a total of 6000
utterances. Data was collected in 2016. Speakers were recruited in the
Bangkok metropolitan area; they were native Thais, fluent in Standard
Thai, and literate.

This data set was developed as part of LDC’s Global TIMIT project
which aims to create a series of corpora in a variety of languages
with a similar set of key features as in the original TIMIT
Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) which was
designed for acoustic-phonetic studies and for the development and
evaluation of automatic speech recognition systems.

2022 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
Third DIHARD Challenge Evaluation was developed by LDC and contains 33
hours of English and Chinese speech data along with corresponding
annotations used in support of the Third DIHARD Challenge.

The DIHARD third development and evaluation sets were drawn from
diverse sources including monologues, map task dialogues, broadcast
interviews, sociolinguistic interviews, meeting speech, speech in
restaurants, clinical recordings, and amateur web videos. Annotations
include diarization and segmentation.

2022 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options; or
contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics

------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2022 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
                   https://crowdfunding.iu.edu/the-linguist-list

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/

----------------------------------------------------------
LINGUIST List: Vol-34-26
----------------------------------------------------------