[Rstlist] GUM Corpus V7 - new genres, Wikification and more
    Amir Zeldes 
    Amir.Zeldes at georgetown.edu
       
    Wed Jan 20 14:31:16 UTC 2021
    
    
  
(Apologies for cross-postings)
 �
*** The GUM Corpus - Release 7.0.0 ***
*** Georgetown University Multilayer corpus ***
 �
Corpling at GU <https://corpling.uis.georgetown.edu/corpling/>  is happy to announce the first release of series 7 of the Georgetown University Multilayer corpus (GUM V7.0.0):
 �
https://corpling.uis.georgetown.edu/gum/ 
 �
New in this version: 
 �
- 20 documents added from four new genres (total tokens: 150,756):
  - Face to face conversation (material from the Santa Barbara Corpus courtesy of John Du Bois, UCSB)
  - Political speeches (public domain data)
  - Open access text books from OpenStax
  - YouTube Creative Commons-licensed vlogs
- New Wikification layer covering all named entities, including nested and pronominal mentions (work by Yi-Ju Lin)
- Added function labels to constituent trees
- Added addressee information for speakers in UD data
- Complete overhaul of date/time normalization (work by Nitin Venkateswaran)
- Complete overhaul of entity and coreference annotations, incl. separate annotation of split antecedents (work by Yi-Ju Lin and Amir Zeldes)
- Increased consistency with other UD corpora, incl. new and more comprehensive morphological features
- Many corrections to all annotation layers
 �
GUM is an open source corpus of richly annotated English texts from multiple genres: academic, bio, conversation, fiction, interview, news, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.
 �
This is the first version of GUM series 7, containing roughly 150K tokens annotated for:
 �
- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features
- Manually corrected lemmatization
- Sentence segmentation and rough speech act (manual)
- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)
- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels)
- Information status (given, accessible, new, split antecedent)
- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging)
- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions
- Discourse parses in Rhetorical Structure Theory and discourse dependencies
 �
Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.
 �
For more information and to search or download the corpus online, see the corpus website <https://corpling.uis.georgetown.edu/gum/> .
 �
Best wishes,
The GUM team
 �
 �
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20210120/c64c1ed6/attachment.htm>
    
    
More information about the Rstlist
mailing list