[Rstlist] GUM Corpus V10 - new genres and annotations

Amir Zeldes Amir.Zeldes at georgetown.edu
Mon Feb 19 16:06:54 UTC 2024

(Apologies for cross-postings)


*** The GUM Corpus - Release 10.0.0 ***

*** Georgetown University Multilayer corpus ***


Corpling at GU <https://gucorpling.org/corpling/>  is happy to announce the first release of series 10 of the Georgetown University Multilayer corpus (GUM V10.0.0):




New in this version: 


- 4 new genres with 22 new documents: (total tokens: 228,399)

  - Courtroom transcripts

  - Essays

  - Letters (on paper, not e-mails)

  - Podcasts

- Expansions to the discourse annotation layer

  - Enhanced RST parses with additional, non-projective tree-breaking relations (multiple relations per node)

  - Complete signaling annotation including discourse markers and other discourse signals following the Signaling Corpus

  - PDTB-style connective annotation and DISRPT style relation classification data

- Morphological segmentation following UniMorph

- Annotation of select constructions based on Construction Grammar (e.g. resultatives, NPN, causal-excess)

- Many corrections to all annotation layers


GUM is an open source corpus of richly annotated English texts from 16 genres: academic, bio, courtroom, conversation, essay, fiction, interview, letters, news, podcasts, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.


This is the first version of GUM series 10, containing roughly 228K tokens annotated for:


- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features

- Manually corrected lemmatization and morphological segmentation

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)

- Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)

- Entity type, salience and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations

- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions

- Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies

- Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)

- Abstractive summaries for each document (two summaries per document in the test set)


Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.


For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> .


Best wishes,

The GUM team


PS – if you like GUM, check out our ‘extreme genre test set’ GENTLE <https://github.com/gucorpling/gentle/> , and the larger, automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/>  corpus!



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20240219/551f7bf4/attachment.htm>

More information about the Rstlist mailing list