[Rstlist] GUM Corpus V10 - new genres and annotations

Amir Zeldes Amir.Zeldes at georgetown.edu
Mon Feb 19 16:06:54 UTC 2024


(Apologies for cross-postings)

 �

*** The GUM Corpus - Release 10.0.0 ***

*** Georgetown University Multilayer corpus ***

 �

Corpling at GU <https://gucorpling.org/corpling/>  is happy to announce the first release of series 10 of the Georgetown University Multilayer corpus (GUM V10.0.0):

 �

https://gucorpling.org/gum/

 �

New in this version: 

 �

- 4 new genres with 22 new documents: (total tokens: 228,399)

  - Courtroom transcripts

  - Essays

  - Letters (on paper, not e-mails)

  - Podcasts

- Expansions to the discourse annotation layer

  - Enhanced RST parses with additional, non-projective tree-breaking relations (multiple relations per node)

  - Complete signaling annotation including discourse markers and other discourse signals following the Signaling Corpus

  - PDTB-style connective annotation and DISRPT style relation classification data

- Morphological segmentation following UniMorph

- Annotation of select constructions based on Construction Grammar (e.g. resultatives, NPN, causal-excess)

- Many corrections to all annotation layers

 �

GUM is an open source corpus of richly annotated English texts from 16 genres: academic, bio, courtroom, conversation, essay, fiction, interview, letters, news, podcasts, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

 �

This is the first version of GUM series 10, containing roughly 228K tokens annotated for:

 �

- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features

- Manually corrected lemmatization and morphological segmentation

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)

- Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)

- Entity type, salience and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations

- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions

- Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies

- Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)

- Abstractive summaries for each document (two summaries per document in the test set)

 �

Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.

 �

For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> .

 �

Best wishes,

The GUM team

 �

PS – if you like GUM, check out our ‘extreme genre test set’ GENTLE <https://github.com/gucorpling/gentle/> , and the larger, automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/>  corpus!

 �

 �

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20240219/551f7bf4/attachment.htm>


More information about the Rstlist mailing list