[Rstlist] GUM Corpus V11 - new documents and annotations

Amir.Zeldes at georgetown.edu Amir.Zeldes at georgetown.edu
Thu Mar 13 20:36:27 UTC 2025


(Apologies for cross-postings)

 �

*** The GUM Corpus - Release 11.0.0 ***

*** Georgetown University Multilayer corpus ***

 �

Corpling at GU <https://gucorpling.org/corpling/>  is happy to announce the first release of series 11 of the Georgetown University Multilayer corpus (GUM V11.0.0):

 �

https://gucorpling.org/gum/

 �

New in this version: 

 �

*	GUM and the out-of-domain test set GENTLE have now merged!
*	New documents – the corpus now contains 268,208 tokens
*	Five different summaries per document
*	Graded salience scores (0-5) for each entity in every document

 �

GUM is an open source corpus of richly annotated English texts from 24 genres: 

 �

*	Main genres: (available in train/dev/test)

*	academic writing
*	biographies
*	courtroom transcripts
*	essays
*	fiction
*	how-to guides
*	interviews
*	letters
*	news
*	online forum discussions
*	podcasts
*	political speeches
*	spontaneous face to face conversations
*	textbooks
*	travel guides
*	vlogs

 �

*	Out-of-domain test genres: (test2, aka GENTLE partition):

*	dictionary entries
*	live esports commentary
*	legal documents
*	medical notes
*	poetry
*	mathematical proofs
*	course syllabuses
*	threat letters

 �

The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

 �

This is the first version of GUM series 11, containing roughly 281 documents annotated for:

 �

*	Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
*	Manually corrected lemmatization and morphological segmentation
*	Sentence segmentation and rough speech act (manual)
*	Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
*	Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)
*	Construction Grammar annotations following UCxn
*	Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)
*	Entity type, graded salience (0-5) and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations
*	Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
*	Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies, including multiple concurrent and non-projective relations
*	Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)
*	Shallow discourse relations following the PDTB v3 scheme
*	Five abstractive summaries for each document following strict, comparable guidelines across genres

 �

Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.

 �

For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> .

 �

Best wishes,

The GUM team

 �

PS – if you like GUM, also check out our automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/>  corpus!

 �

 �

 �

 �

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20250313/7fce0615/attachment.htm>


More information about the Rstlist mailing list