36.1570, FYI: May 2025 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Fri May 16 09:05:02 UTC 2025
LINGUIST List: Vol-36-1570. Fri May 16 2025. ISSN: 1069 - 4875.
Subject: 36.1570, FYI: May 2025 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Joel Jenkins <joel at linguistlist.org>
================================================================
Date: 15-May-2025
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: May 2025 Newsletter - LDC
In this newsletter:
New publications:
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Audio
BOLT CTS CALLFRIEND CALLHOME Mandarin Chinese Transcripts and
Translations
________________________________________
New publications:
BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio was
developed by LDC and consists of 93 hours of speech from 236
unscripted telephone conversations between native speakers of the
Mandarin Chinese dialect spoken in mainland China. The calls were
collected by LDC in the CALLFRIEND and CALLHOME series where
participants called family members or close friends and spoke on
topics of their choice. Around 60% of the recordings (141 calls) are
publicly released for the first time. The remaining 95 recordings were
previously published by LDC in various CALLFRIEND, CALLHOME, and HUB5
Mandarin datasets. The data is divided into training, development, and
evaluation partitions.
The DARPA BOLT (Broad Operational Language Translation) program
developed machine translation and information retrieval for less
formal genres, focusing particularly on user-generated content. LDC
supported the BOLT program by collecting informal data sources --
discussion forums, conversational telephone speech, text messaging,
and chat -- in Chinese, Egyptian Arabic, and English. The material in
this release represents the unannotated Chinese source conversational
telephone speech. The telephone data was transcribed, translated, and
annotated for various tasks in the BOLT program including word
alignment, treebanking, and co-reference.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and
Translations contains transcripts and corresponding English
translations for the conversational telephone speech in BOLT CTS
CALLFRIEND CALLHOME Mandarin Chinese Audio and was developed by LDC to
support the DARPA BOLT program.
Transcribers were required to produce a verbatim transcript of all
speech within a file using simplified Chinese orthography and to add
minimal markup to capture salient features of the speech. Some
transcripts include redactions for potential personally identifying
information. All speech data was transcribed and is divided into
training, development, and evaluation partitions.
The goal of the BOLT translation task was to translate the Chinese
transcripts into fluent English while preserving the meaning present
in the original Chinese text. Transcripts in the development and
evaluation partitions received first pass and gold standard
translations. 89% of the transcripts were translated into English.
2025 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account and
uncheck the box next to “Receive Newsletter” under Account Options or
contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Cambridge University Press http://www.cambridge.org/linguistics
Cascadilla Press http://www.cascadilla.com/
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Edinburgh University Press http://www.edinburghuniversitypress.com
Elsevier Ltd http://www.elsevier.com/linguistics
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Netherlands Graduate School of Linguistics / Landelijke (LOT) http://www.lotpublications.nl/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-36-1570
----------------------------------------------------------
More information about the LINGUIST
mailing list