[Rstlist] GUM Corpus V10 - new genres and annotations
Amir Zeldes
Amir.Zeldes at georgetown.edu
Mon Feb 19 16:06:54 UTC 2024
(Apologies for cross-postings)
�
*** The GUM Corpus - Release 10.0.0 ***
*** Georgetown University Multilayer corpus ***
�
Corpling at GU <https://gucorpling.org/corpling/> is happy to announce the first release of series 10 of the Georgetown University Multilayer corpus (GUM V10.0.0):
�
https://gucorpling.org/gum/
�
New in this version:
�
- 4 new genres with 22 new documents: (total tokens: 228,399)
- Courtroom transcripts
- Essays
- Letters (on paper, not e-mails)
- Podcasts
- Expansions to the discourse annotation layer
- Enhanced RST parses with additional, non-projective tree-breaking relations (multiple relations per node)
- Complete signaling annotation including discourse markers and other discourse signals following the Signaling Corpus
- PDTB-style connective annotation and DISRPT style relation classification data
- Morphological segmentation following UniMorph
- Annotation of select constructions based on Construction Grammar (e.g. resultatives, NPN, causal-excess)
- Many corrections to all annotation layers
�
GUM is an open source corpus of richly annotated English texts from 16 genres: academic, bio, courtroom, conversation, essay, fiction, interview, letters, news, podcasts, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
�
This is the first version of GUM series 10, containing roughly 228K tokens annotated for:
�
- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
- Manually corrected lemmatization and morphological segmentation
- Sentence segmentation and rough speech act (manual)
- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies)
- Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)
- Entity type, salience and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations
- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
- Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies
- Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme)
- Abstractive summaries for each document (two summaries per document in the test set)
�
Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.
�
For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> .
�
Best wishes,
The GUM team
�
PS – if you like GUM, check out our ‘extreme genre test set’ GENTLE <https://github.com/gucorpling/gentle/> , and the larger, automatically annotated AMALGUM <https://github.com/gucorpling/amalgum/> corpus!
�
�
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20240219/551f7bf4/attachment.htm>
More information about the Rstlist
mailing list