37.1785, FYI: May 2026 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Fri May 15 17:05:02 UTC 2026


LINGUIST List: Vol-37-1785. Fri May 15 2026. ISSN: 1069 - 4875.

Subject: 37.1785, FYI: May 2026 Newsletter - LDC

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================


Date: 15-May-2026
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: May 2026 Newsletter - LDC


In this newsletter:
New publications:
MADCAT Phases 1-3 Composite Evaluation Set
CALLHOME German Second Edition
CALLHOME German Lexicon Second Edition
________________________________________
New publications:
MADCAT Phases 1-3 Composite Evaluation Set contains the evaluation
data created by LDC for Phases 1-3 of the DARPA MADCAT program and the
NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten
Arabic documents scanned at high resolution and annotated for the
physical coordinates of each line and token, digital transcripts, and
English translations with content and annotation layers integrated in
a single MADCAT XML output.
This release includes 1,643 images and corresponding annotation files.
Source documents were web text and newswire collected by LDC.
Arabic-speaking scribes copied documents by hand, following specific
instructions as to the writing style, writing implement, and paper.
Each page was scanned and the images annotated.
The goal of the MADCAT program was to automatically convert foreign
language text images into English transcripts for use by humans and
downstream processes, including summarization and information
extraction. The core evaluation task in MADCAT was the translation of
handwritten Arabic documents.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
CALLHOME German Second Edition was developed by LDC and contains 48
hours of speech from 100 unscripted telephone conversations between
native German speakers. This publication is a re-release of the
original CALLHOME German collection, combining CALLHOME German Speech
(LDC97S43) and CALLHOME German Transcripts (LDC97T15), with additional
transcription and updated directory structure, file formats, and
documentation.
This release contains the 100 telephone conversations published in
CALLHOME German Speech which represented training data (80 calls) and
development data (20 calls). Participants spoke on topics of their
choice in a single telephone call lasting up to 30 minutes. Calls were
manually audited for language, recording quality, channel
characteristics, dialect, and region. For this second edition, all
audio was converted from SPHERE files to FLAC format, and the original
training/development partitioning was removed.
This release also features revised transcripts conforming to updated
LDC transcription guidelines that addressed normalization of
annotation formats, standardization of speaker-produced and background
noises, application of foreign-language marking, whitespace cleanup,
and corrections and consistency fixes.
The CALLHOME series consists of telephone conversations and
transcripts developed by LDC and Rutgers, The State University of New
Jersey, in support of research in speaker identification, language
identification, and related technologies. Languages in the series
include American English, Egyptian Arabic, German, Japanese, Mandarin
Chinese, and Spanish.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
CALLHOME German Lexicon Second Edition was developed by LDC and
contains 318,809 German words with morphological, phonological,
stress, and frequency information. This second edition updates file
formats, directory structure, and documentation. The first edition is
available as CALLHOME German Lexicon (LDC97L18).
The words in the lexicon were derived from the CELEX German lexicon
(CELEX2 (LDC96L14)) and from 100 training and development transcripts
representing unscripted telephone conversations between native German
speakers contained in CALLHOME German Second Edition, LDC2026S04.
The lexicon has seven tab-separated information fields: (1) headword:
orthographic form; (2) morph: morphological analysis of the headword;
(3) pron: pronunciation of the headword; (4) stress: primary stress
information of the word; (5) celex: whether the headword appears in
the CELEX German lexicon; (6) train_freq: frequency of the headword in
the CALLHOME training transcripts; and (7) dev_freq: frequency of the
headword in the CALLHOME development transcripts. This release also
includes a pronunciation dictionary derived from the lexicon in
CMUdict format.
2026 members can access this corpus through their LDC accounts
provided they have submitted a completed copy of the special license
agreement. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics




------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

European Language Resources Association (ELRA) http://www.elra.info

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MDPI Languages https://www.mdpi.com/journal/languages

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com

SIL International Publications http://www.sil.org/resources/publications


----------------------------------------------------------
LINGUIST List: Vol-37-1785
----------------------------------------------------------



More information about the LINGUIST mailing list