37.1082, FYI: March 2026 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Tue Mar 17 17:05:02 UTC 2026
LINGUIST List: Vol-37-1082. Tue Mar 17 2026. ISSN: 1069 - 4875.
Subject: 37.1082, FYI: March 2026 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Daniel Swanson <daniel at linguistlist.org>
================================================================
Date: 16-Mar-2026
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: March 2026 Newsletter - LDC
In this newsletter:
LDC data and commercial technology development
New publications
Ancient Chinese WordNet
CALLHOME Spanish Second Edition
CALLHOME Spanish Lexicon Second Edition
________________________________________
LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product
or for any commercial purpose. LDC data users should consult
corpus-specific license agreements for limitations on the use of
certain corpora. Visit the Licensing page for further information.
________________________________________
New publications:
Ancient Chinese WordNet was developed by Nanjing Normal University and
contains lexical and semantic information for Ancient Chinese
vocabulary from the Pre-Qin period (before 221 BCE). The WordNet
comprises 38,781 word forms and 55,100 senses, each manually linked to
a corresponding synset in Princeton WordNet 1.6 and covering 22 noun
categories, 15 verb categories, and additional adjective and adverb
categories. The Ancient Chinese WordNet project began in 2012 with the
goal of creating a structured lexical database to support linguistic
research and natural language processing applications involving
historical Chinese language materials.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
CALLHOME Spanish Second Edition was developed by LDC and contains 38
hours of speech from 120 unscripted telephone conversations between
native Spanish speakers. This publication is a re-release of the
original CALLHOME Spanish collection, combining CALLHOME Spanish
Speech (LDC96S35) and CALLHOME Spanish Transcripts (LDC96T17), with
additional transcription and updated directory structure, file
formats, and documentation.
This corpus contains the 120 calls from CALLHOME Spanish Speech which
represented training and development data and a subset of evaluation
data. Participants spoke on topics of their choice in a single
telephone call lasting up to 30 minutes. Calls were manually audited
for language, recording quality, channel characteristics, dialect, and
region. For this second edition, all audio was converted from SPHERE
files to FLAC format, and the original training/development/test
partitioning was removed.
This release also features revised transcripts conforming to updated
LDC transcription guidelines that addressed normalization of
annotation formats, standardization of speaker-produced and background
noises, application of foreign-language marking, whitespace cleanup,
and corrections and consistency fixes.
The CALLHOME series consists of telephone conversations and
transcripts developed by LDC and Rutgers, The State University of New
Jersey, in support of research in speaker identification, language
identification, and related technologies. Languages in the series
include American English, Egyptian Arabic, German, Japanese, Mandarin
Chinese, and Spanish.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
CALLHOME Spanish Lexicon Second Edition was developed by LDC and
contains 45,547 Spanish words with morphological, phonological,
stress, and frequency information. This second edition updates file
formats, directory structure, and documentation. The first edition is
available as CALLHOME Spanish Lexicon (LDC96L16). The words in the
lexicon were derived from 80 transcripts representing unscripted
telephone conversations between native Spanish speakers contained in
CALLHOME Spanish Second Edition LDC2026S04 and from various Spanish
news texts.
The lexicon contains nine tab-separated information fields: (1)
headword: orthographic form; (2) morph: morphological analysis of the
headword; (3) pron: pronunciation of the headword; (4) stress: primary
stress information of the word; (5) callh freq: frequency of the
headword in CALLHOME transcripts; (6) madrid freq: frequency of the
headword in Madrid Radio transcripts; (7) ap freq: frequency of the
headword in Associated Press newswire; (8) reut freq: frequency of the
headword in Reuters newswire; and (9) norte freq: frequency of the
headword in El Norte newswire.
This release also includes a pronunciation dictionary derived from the
lexicon in CMUdict format and the grapheme-to-phoneme (G2P) tools used
to automatically generate pronunciations for the original lexicon.
2026 members can access this corpus through their LDC accounts
provided they have submitted a completed copy of the special license
agreement. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en
Edinburgh University Press http://www.edinburghuniversitypress.com
European Language Resources Association (ELRA) http://www.elra.info
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
MIT Press http://mitpress.mit.edu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Peter Lang AG http://www.peterlang.com
SIL International Publications http://www.sil.org/resources/publications
----------------------------------------------------------
LINGUIST List: Vol-37-1082
----------------------------------------------------------
More information about the LINGUIST
mailing list