35.2865, FYI: October 2024 Newsletter - LDC
The LINGUIST List
linguist at listserv.linguistlist.org
Wed Oct 16 05:05:02 UTC 2024
LINGUIST List: Vol-35-2865. Wed Oct 16 2024. ISSN: 1069 - 4875.
Subject: 35.2865, FYI: October 2024 Newsletter - LDC
Moderator: Steven Moran (linguist at linguistlist.org)
Managing Editor: Justin Fuller
Team: Helen Aristar-Dry, Steven Franks, Joel Jenkins, Daniel Swanson, Erin Steitz
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org
Homepage: http://linguistlist.org
Editor for this issue: Joel Jenkins <joel at linguistlist.org>
================================================================
Date: 15-Oct-2024
From: Membership Coordinator [ldc at ldc.upenn.edu]
Subject: October 2024 Newsletter - LDC
In this newsletter:
Membership year 2025 publication preview
New publications:
RST Continuity Corpus
MultiTACRED
Membership year 2025 publication preview:
The 2025 membership year is approaching and plans for next year’s
publications are in progress. Among the expected releases are:
• Iraqi Arabic – English Lexical Database: a set of six interrelated
tables (roots, lemmas, wordforms, multi-word expressions, English
definitions, example phrases) presenting each Iraqi Arabic word in
Arabic script and IPA format.
• AIDA topic source data and annotations: multimodal source data and
annotations in multiple languages for information and entity
extraction
• 2015 NIST Language Recognition Evaluation Test Set: 164,000+
segments of conversational telephone speech and broadcast narrow band
speech in six linguistic varieties representing 20 languages, used in
NIST’s 2015 language recognition evaluation
• BOLT CALLFRIEND CALLHOME CTS Audio, Transcripts and Translations:
previously unpublished Chinese and Egyptian Arabic telephone
conversations from the CALLFRIEND and CALLHOME collections, with
transcripts and translations developed by LDC for the DARPA BOLT
program
• Chinese Sentence Pattern Structure Treebank: 5,000+ sentences from
ancient and modern Chinese texts with syntactic annotation based on
sentence constituent analysis, developed by Beijing Normal University
and Peking University
• IARPA MATERIAL language packs: conversational telephone speech,
transcripts, English translations, annotations, and queries in
multiple languages
• LORELEI: representative and incident language packs containing
monolingual text, bi-text, translations, annotations, supplemental
resources, and related tools.
New publications:
RST Continuity Corpus was developed at Åbo Akademi University and
Humboldt-Universität zu Berlin and contains annotations for continuity
dimensions added to RST Discourse Treebank (LDC2002T07). RST Discourse
Treebank is a collection of English news texts from the Penn Treebank
annotated for rhetorical relations under the RST (Rhetorical Structure
Theory) framework. In RST Continuity Corpus, the relations are
annotated for the seven continuity dimensions: time, space, reference,
action, perspective, modality, and speech act. The relations are also
annotated for polarity, order of segments, nuclearity, and context.
2024 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
*
MultiTACRED was developed by the German Research Center for Artificial
Intelligence (DFKI) Speech and Language Technology Lab and is a
machine translation of TAC Relation Extraction Dataset (LDC2018T24)
(TACRED) into twelve languages with projected entity annotations.
TACRED is a large-scale relation extraction dataset containing 106,264
examples built over English newswire and web text used in the NIST TAC
KBP English slot filling evaluations during the period 2009-2014. The
training and evaluation data for the TAC KBP slot filling tasks was
developed by the Linguistic Data Consortium.
TACRED training, development, and test splits were translated into
Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese,
Polish, Russian, Spanish, and Turkish using DeepL or Google Translate.
The test split was back-translated into English to generate
machine-translated English test data. TACRED annotations are specified
by token offsets. For translation, tokens were concatenated with white
space, and the entity offsets were converted into XML-style markers to
denote argument.
2024 members can access this corpus through their LDC accounts.
Non-members may license this data for a fee.
Membership Coordinator
Linguistic Data Consortium
University of Pennsylvania
T: +1-215-573-1275
E: ldc at ldc.upenn.edu
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Linguistic Field(s): Computational Linguistics
------------------------------------------------------------------------------
********************** LINGUIST List Support ***********************
Please consider donating to the Linguist List to support the student editors:
https://www.paypal.com/donate/?hosted_button_id=87C2AXTVC4PP8
LINGUIST List is supported by the following publishers:
Bloomsbury Publishing http://www.bloomsbury.com/uk/
Brill http://www.brill.com
Cambridge University Press http://www.cambridge.org/linguistics
De Gruyter Mouton https://cloud.newsletter.degruyter.com/mouton
Edinburgh University Press https://edinburghuniversitypress.com
Equinox Publishing Ltd http://www.equinoxpub.com/
European Language Resources Association (ELRA) http://www.elra.info
John Benjamins http://www.benjamins.com/
Language Science Press http://langsci-press.org
Lincom GmbH https://lincom-shop.eu/
Multilingual Matters http://www.multilingual-matters.com/
Narr Francke Attempto Verlag GmbH + Co. KG http://www.narr.de/
Oxford University Press http://www.oup.com/us
Wiley http://www.wiley.com
----------------------------------------------------------
LINGUIST List: Vol-35-2865
----------------------------------------------------------
More information about the LINGUIST
mailing list