[Rstlist] GUM Corpus V10 - new genres and annotations
    Amir Zeldes 
    Amir.Zeldes at georgetown.edu
       
    Mon Feb 19 16:06:54 UTC 2024
    
    
  
(Apologies for cross-postings)
 �
*** The GUM Corpus - Release 10.0.0 ***
*** Georgetown University Multilayer corpus ***
 �
Corpling at GU <https://gucorpling.org/corpling/>  is happy to announce the first release of series 10 of the Georgetown University Multilayer corpus (GUM V10.0.0):
 �
https://gucorpling.org/gum/
 �
New in this version: 
 �
- 4 new genres with 22 new documents: (total tokens: 228,399)
  - Courtroom transcripts
  - Essays
  - Letters (on paper, not e-mails)
  - Podcasts
- Expansions to the discourse annotation layer
  - Enhanced RST parses with additional, non-projective tree-breaking relations (multiple relations per node)
  - Complete signaling annotation including discourse markers and other discourse signals following the Signaling Corpus
  - PDTB-style connective annotation and DISRPT style relation classification data
- Morphological segmentation following UniMorph
- Annotation of select constructions based on Construction Grammar (e.g. resultatives, NPN, causal-excess)
- Many corrections to all annotation layers
 �
GUM is an open source corpus of richly annotated English texts from 16 genres: academic, bio, courtroom, conversation, essay, fiction, interview, letters, news, podcasts, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
 �
This is the first version of GUM series 10, containing roughly 228K tokens annotated for:
 �
- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
- Manually corrected lemmatization and morphological segmentation
- Sentence segmentation and rough speech act (manual)
- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)
- Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)
- Entity type, salience and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations
- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
- Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies
- Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)
- Abstractive summaries for each document (two summaries per document in the test set)
 �
Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.
 �
For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> .
 �
Best wishes,
The GUM team
 �
PS – if you like GUM, check out our ‘extreme genre test set’ GENTLE <https://github.com/gucorpling/gentle/> , and the larger, automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/>  corpus!
 �
 �
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20240219/551f7bf4/attachment.htm>
    
    
More information about the Rstlist
mailing list