[Rstlist] GUM Corpus V7 - new genres, Wikification and more

Wed Jan 20 14:31:16 UTC 2021

(Apologies for cross-postings)

 �

*** The GUM Corpus - Release 7.0.0 ***

*** Georgetown University Multilayer corpus ***

 �

Corpling at GU <https://corpling.uis.georgetown.edu/corpling/>  is happy to announce the first release of series 7 of the Georgetown University Multilayer corpus (GUM V7.0.0):

 �

https://corpling.uis.georgetown.edu/gum/ 

 �

New in this version: 

 �

- 20 documents added from four new genres (total tokens: 150,756):

  - Face to face conversation (material from the Santa Barbara Corpus courtesy of John Du Bois, UCSB)

  - Political speeches (public domain data)

  - Open access text books from OpenStax

  - YouTube Creative Commons-licensed vlogs

- New Wikification layer covering all named entities, including nested and pronominal mentions (work by Yi-Ju Lin)

- Added function labels to constituent trees

- Added addressee information for speakers in UD data

- Complete overhaul of date/time normalization (work by Nitin Venkateswaran)

- Complete overhaul of entity and coreference annotations, incl. separate annotation of split antecedents (work by Yi-Ju Lin and Amir Zeldes)

- Increased consistency with other UD corpora, incl. new and more comprehensive morphological features

- Many corrections to all annotation layers

 �

GUM is an open source corpus of richly annotated English texts from multiple genres: academic, bio, conversation, fiction, interview, news, speeches, textbooks, travel, vlogs, how-to and Reddit forum discussions. The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses.

 �

This is the first version of GUM series 7, containing roughly 150K tokens annotated for:

 �

- Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features

- Manually corrected lemmatization

- Sentence segmentation and rough speech act (manual)

- Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual)

- Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels)

- Information status (given, accessible, new, split antecedent)

- Entity and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging)

- Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions

- Discourse parses in Rhetorical Structure Theory and discourse dependencies

 �

Note on Reddit data: token text is not contained in the release but can be downloaded with an included script.

 �

For more information and to search or download the corpus online, see the corpus website <https://corpling.uis.georgetown.edu/gum/> .

 �

Best wishes,

The GUM team

 �

 �

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20210120/c64c1ed6/attachment.htm>