[Rstlist] GUM Corpus V9 - new data and annotations
Amir Zeldes
Amir.Zeldes at georgetown.edu
Thu Feb 2 20:46:50 UTC 2023
(Apologies for cross-postings)
*** The GUM Corpus - Release 9.0.0 ***
*** Georgetown University Multilayer corpus ***
Corpling at GU <https://gucorpling.org/corpling/> is happy to announce the first release of series 9 of the Georgetown University Multilayer corpus (GUM V9.0.0):
https://gucorpling.org/gum/
New in this version:
- 20 new documents added including more conversational data (total tokens: 203,879)
- Abstractive summaries for each document
- Annotations for salient/non-salient entities in each document
- Foreign language tags to identify individual source languages where relevant
- New easier process for reconstructing Reddit text data
- Many corrections to all annotation layers
GUM is an open source corpus of richly annotated English texts from multiple genres: academic, bio, conversation, fiction, interview, news, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
This is the first version of GUM series 9, containing roughly 200K tokens annotated for:
- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
- Manually corrected lemmatization
- Sentence segmentation and rough speech act (manual)
- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels)
- Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new)
- Entity type, salience and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging)
- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
- Discourse parses in Rhetorical Structure Theory and discourse dependencies
- Abstractive summaries
Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.
For more information and to search or download the corpus online, see the corpus website <https://gucorpling.org/gum/> .
Best wishes,
The GUM team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20230202/dcf6656e/attachment.htm>
More information about the Rstlist
mailing list