From Amir.Zeldes at georgetown.edu Thu Mar 13 20:36:27 2025
From: Amir.Zeldes at georgetown.edu (Amir.Zeldes at georgetown.edu)
Date: Thu, 13 Mar 2025 16:36:27 -0400
Subject: [Rstlist] GUM Corpus V11 - new documents and annotations
Message-ID: <008b01db9457$98447350$c8cd59f0$@georgetown.edu>
(Apologies for cross-postings)
?
*** The GUM Corpus - Release 11.0.0 ***
*** Georgetown University Multilayer corpus ***
?
Corpling at GU is happy to announce the first release of series 11 of the Georgetown University Multilayer corpus (GUM V11.0.0):
?
https://gucorpling.org/gum/
?
New in this version:
?
* GUM and the out-of-domain test set GENTLE have now merged!
* New documents ? the corpus now contains 268,208 tokens
* Five different summaries per document
* Graded salience scores (0-5) for each entity in every document
?
GUM is an open source corpus of richly annotated English texts from 24 genres:
?
* Main genres: (available in train/dev/test)
* academic writing
* biographies
* courtroom transcripts
* essays
* fiction
* how-to guides
* interviews
* letters
* news
* online forum discussions
* podcasts
* political speeches
* spontaneous face to face conversations
* textbooks
* travel guides
* vlogs
?
* Out-of-domain test genres: (test2, aka GENTLE partition):
* dictionary entries
* live esports commentary
* legal documents
* medical notes
* poetry
* mathematical proofs
* course syllabuses
* threat letters
?
The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
?
This is the first version of GUM series 11, containing roughly 281 documents annotated for:
?
* Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
* Manually corrected lemmatization and morphological segmentation
* Sentence segmentation and rough speech act (manual)
* Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
* Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)
* Construction Grammar annotations following UCxn
* Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)
* Entity type, graded salience (0-5) and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations
* Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
* Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies, including multiple concurrent and non-projective relations
* Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)
* Shallow discourse relations following the PDTB v3 scheme
* Five abstractive summaries for each document following strict, comparable guidelines across genres
?
Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.
?
For more information and to search or download the corpus online, see the corpus website .
?
Best wishes,
The GUM team
?
PS ? if you like GUM, also check out our automatically annotated AMALGUM corpus!
?
?
?
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: