37.224, FYI: January 2026 Newsletter - LDC

The LINGUIST List linguist at listserv.linguistlist.org
Fri Jan 16 23:05:02 UTC 2026


LINGUIST List: Vol-37-224. Fri Jan 16 2026. ISSN: 1069 - 4875.

Subject: 37.224, FYI: January 2026 Newsletter - LDC

Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Valeriia Vyshnevetska
Team: Helen Aristar-Dry, Mara Baccaro, Daniel Swanson
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Editor for this issue: Daniel Swanson <daniel at linguistlist.org>

================================================================


Date: 15-Jan-2026
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: January 2026 Newsletter - LDC


In this newsletter:
Renew your LDC membership today
New publications:
CALLHOME Japanese Second Edition
CALLHOME Japanese Lexicon Second Edition
MATERIAL Swahili-English Language Pack
________________________________________
Renew your LDC membership today
The importance of curated resources for language-related education,
research, and technology development drives LDC’s mission to create
them, to accept data contributions from researchers across the globe,
and to broadly share such resources through the LDC Catalog. LDC
members enjoy no-cost access to new corpora released annually, as well
as the ability to license legacy data sets from among our 1000
holdings at reduced fees. Ensure that your data needs continue to be
met by renewing your LDC membership or by joining the Consortium
today.
Now through March 2, 2026, any organization that joins the Consortium
or renews their membership will receive a 10% discount off the 2026
membership fee. Membership remains the most economical way to access
current and past LDC releases. Consult Join LDC for more details on
membership options and benefits.
________________________________________
New publications:
CALLHOME Japanese Second Edition was developed by LDC and contains 49
hours of speech from 120 telephone conversations between native
Japanese speakers. This publication is a re-release of the original
CALLHOME Japanese collection, combining CALLHOME Japanese Speech
(LDC96S37) and CALLHOME Japanese Transcripts (LDC96T18) with
additional transcription and updated directory structure, file
formats, and documentation.
This corpus contains the 120 calls from CALLHOME Japanese Speech which
represented training and development data and a subset of evaluation
data. Participants spoke on topics of their choice in a single
telephone call lasting up to 30 minutes. Calls were manually audited
for language, recording quality, channel characteristics, dialect, and
region. For this second edition, all audio was converted from SPHERE
files to FLAC format, and the original training/development/test
partitioning was removed.
This release also features revised transcripts conforming to updated
LDC transcription guidelines that addressed normalization of
annotation formats, standardization of speaker-produced and background
noises, application of foreign-language marking, whitespace cleanup,
and corrections and consistency fixes.
The CALLHOME series consists of telephone conversations and
transcripts developed by LDC and Rutgers, The State University of New
Jersey, in support of research in speaker identification, language
identification, and related technologies. Languages in the series
include American English, Egyptian Arabic, German, Japanese, Mandarin
Chinese, and Spanish.
2026 members can access this corpus through their LDC accounts.
Non-members may license this data for $3000.
*
CALLHOME Japanese Lexicon Second Edition was developed by LDC and
contains 80,688 Japanese words with morphological, phonological, and
stress information. This second edition updates file formats,
directory structure, and documentation. The first edition is available
as CALLHOME Japanese Lexicon (LDC96L17). The words in the lexicon were
derived from 80 transcripts representing telephone conversations
between native Japanese speakers contained in CALLHOME Japanese Second
Edition (LDC2026S02).
The lexicon contains seven tab-separated information fields: (1)
headword: orthographic form in kanji or katakana or hiragana (if only
written in hiragana); (2) hiragana: orthographic form in hiragana; (3)
romanization: orthographic form in romaji; (4) pron: pronunciation of
the headword; (5) morph: morphological analysis of the headword; (6)
train freq: frequency of the headword in the transcripts; and (7)
gloss: glosses of the headword. This release also includes a
pronunciation dictionary derived from the lexicon in CMUdict format
and the grapheme-to-phoneme (G2P) tools used to automatically generate
pronunciations for the original lexicon.
2026 members can access this corpus through their LDC accounts
provided they have submitted a completed copy of the special license
agreement. Non-members may license this data for $2250.
*
MATERIAL Swahili-English Language Pack was developed by Appen for the
IARPA MATERIAL program and contains 112 hours of Swahili
conversational telephone speech, transcripts, English translations,
annotations, and queries. Calls were made using different telephones
(e.g., mobile, landline) from a variety of environments. Transcripts
cover approximately 30% of the speech files, 3% of which were
translated into English. This release also includes domain
annotations, English queries, and their relevance annotations.
The MATERIAL program focused on underserved languages with the
ultimate goal to build cross language information retrieval systems to
find speech and text content using English search queries.
2026 members can access this corpus through their LDC accounts
provided they have submitted a completed copy of the special license
agreement. Non-members may license this data for $250.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

Linguistic Field(s): Computational Linguistics




------------------------------------------------------------------------------

********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List, a U.S. 501(c)(3) not for profit organization:

https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8

LINGUIST List is supported by the following publishers:

Bloomsbury Publishing http://www.bloomsbury.com/uk/

Cambridge University Press http://www.cambridge.org/linguistics

Cascadilla Press http://www.cascadilla.com/

De Gruyter Brill https://www.degruyterbrill.com/?changeLang=en

Edinburgh University Press http://www.edinburghuniversitypress.com

John Benjamins http://www.benjamins.com/

Language Science Press http://langsci-press.org

Lincom GmbH https://lincom-shop.eu/

MIT Press http://mitpress.mit.edu/

Multilingual Matters http://www.multilingual-matters.com/

Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/

Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/

Peter Lang AG http://www.peterlang.com


----------------------------------------------------------
LINGUIST List: Vol-37-224
----------------------------------------------------------



More information about the LINGUIST mailing list