[Rstlist] GUM Corpus V7 - new genres, Wikification and more
Amir Zeldes
Amir.Zeldes at georgetown.edu
Wed Jan 20 14:31:16 UTC 2021
(Apologies for cross-postings)
�
*** The GUM Corpus - Release 7.0.0 ***
*** Georgetown University Multilayer corpus ***
�
Corpling at GU <https://corpling.uis.georgetown.edu/corpling/> is happy to announce the first release of series 7 of the Georgetown University Multilayer corpus (GUM V7.0.0):
�
https://corpling.uis.georgetown.edu/gum/
�
New in this version:
�
- 20 documents added from four new genres (total tokens: 150,756):
- Face to face conversation (material from the Santa Barbara Corpus courtesy of John Du Bois, UCSB)
- Political speeches (public domain data)
- Open access text books from OpenStax
- YouTube Creative Commons-licensed vlogs
- New Wikification layer covering all named entities, including nested and pronominal mentions (work by Yi-Ju Lin)
- Added function labels to constituent trees
- Added addressee information for speakers in UD data
- Complete overhaul of date/time normalization (work by Nitin Venkateswaran)
- Complete overhaul of entity and coreference annotations, incl. separate annotation of split antecedents (work by Yi-Ju Lin and Amir Zeldes)
- Increased consistency with other UD corpora, incl. new and more comprehensive morphological features
- Many corrections to all annotation layers
�
GUM is an open source corpus of richly annotated English texts from multiple genres: academic, bio, conversation, fiction, interview, news, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
�
This is the first version of GUM series 7, containing roughly 150K tokens annotated for:
�
- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
- Manually corrected lemmatization
- Sentence segmentation and rough speech act (manual)
- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels)
- Information status (given, accessible, new, split antecedent)
- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging)
- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
- Discourse parses in Rhetorical Structure Theory and discourse dependencies
�
Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.
�
For more information and to search or download the corpus online, see the corpus website <https://corpling.uis.georgetown.edu/gum/> .
�
Best wishes,
The GUM team
�
�
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20210120/c64c1ed6/attachment.htm>
More information about the Rstlist
mailing list