[Rstlist] GUM Corpus 6.0.0

Amir Zeldes Amir.Zeldes at georgetown.edu
Mon Mar 9 14:14:36 EDT 2020

(Apologies for cross-postings)


*** The GUM Corpus - Release 6.0.0 ***

*** Georgetown University Multilayer corpus ***


The Corpling Lab at Georgetown University
<http://corpling.uis.georgetown.edu/corpling/>  is happy to announce the
first release of series 6 of the Georgetown University Multilayer corpus


New in this version: 

-        22 documents added (total tokens: 129,660)

-        Discourse parses in Rhetorical Structure Theory now follow RST-DT

-        5 new relations (means, manner, attribution, question and

-        Discourse dependency representation and lisp-style formats

-        Now using native Universal Dependencies syntax trees (not automatic

-        Many manual corrections to lemmatization, POS and other consistency


GUM is an open source corpus of richly annotated English texts from multiple
genres: academic, bio, fiction, interview, news, travel, how-to and Reddit
forum discussions. The corpus is created by students as part of the
Computational Linguistics curriculum at Georgetown University and is
available under Creative Commons licenses.


This is the first version of GUM series 6, containing nearly 130K tokens
annotated for:


-        Multiple POS tags (100% manual gold PTB, extended PTB, converted
CLAWS5 and UPOS) and UD morphological features

-        Manually corrected lemmatization

-        Sentence segmentation and rough speech act (manual)

-        Document structure using TEI tags (paragraphs, headings, figures,
captions etc., all manual)

-        Constituent and dependency syntax (manually corrected Universal
Dependencies, and PTB parses from gold tags)

-        Information status (given, accessible, new)

-        Entity and coreference annotation (including non-named entities,
singletons, appositions, cataphora and several types of bridging)

-        Discourse parses in Rhetorical Structure Theory


Note on Reddit data: token text is not contained in the release but can be
downloaded with an included script.


For more information and to search or download the corpus online, see:







Dr. Amir Zeldes

Assoc. Prof. of Computational Linguistics

Department of Linguistics

Georgetown University

1437 37th St. NW

Washington, DC 20057











-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/rstlist/attachments/20200309/c6ad5cb9/attachment.html>

More information about the Rstlist mailing list