From Amir.Zeldes at georgetown.edu Thu Mar 13 20:36:27 2025 From: Amir.Zeldes at georgetown.edu (Amir.Zeldes at georgetown.edu) Date: Thu, 13 Mar 2025 16:36:27 -0400 Subject: [Rstlist] GUM Corpus V11 - new documents and annotations Message-ID: <008b01db9457$98447350$c8cd59f0$@georgetown.edu> (Apologies for cross-postings) ? *** The GUM Corpus - Release 11.0.0 *** *** Georgetown University Multilayer corpus *** ? Corpling at GU is happy to announce the first release of series 11 of the Georgetown University Multilayer corpus (GUM V11.0.0): ? https://gucorpling.org/gum/ ? New in this version: ? * GUM and the out-of-domain test set GENTLE have now merged! * New documents ? the corpus now contains 268,208 tokens * Five different summaries per document * Graded salience scores (0-5) for each entity in every document ? GUM is an open source corpus of richly annotated English texts from 24 genres: ? * Main genres: (available in train/dev/test) * academic writing * biographies * courtroom transcripts * essays * fiction * how-to guides * interviews * letters * news * online forum discussions * podcasts * political speeches * spontaneous face to face conversations * textbooks * travel guides * vlogs ? * Out-of-domain test genres: (test2, aka GENTLE partition): * dictionary entries * live esports commentary * legal documents * medical notes * poetry * mathematical proofs * course syllabuses * threat letters ? The corpus is created by students as part of the Computational Linguistics curriculum at Georgetown University and is available under Creative Commons licenses. ? This is the first version of GUM series 11, containing roughly 281 documents annotated for: ? * Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 and UPOS) and UD morphological features * Manually corrected lemmatization and morphological segmentation * Sentence segmentation and rough speech act (manual) * Document structure using TEI tags (paragraphs, headings, figures, captions etc., all manual) * Constituent and dependency syntax (manually corrected Universal Dependencies, and PTB parses from gold tags with function labels and enhanced dependencies) * Construction Grammar annotations following UCxn * Information status (given-active/inactive, accessible-inferable/common ground/aggregate, and new) * Entity type, graded salience (0-5) and coreference annotation (including non-named entities, singletons, appositions, cataphora and several types of bridging), as well as Centering Theory annotations * Entity linking (Wikification) of all named entities with Wikipedia articles, including their non-named and pronominal mentions * Discourse parses in enhanced Rhetorical Structure Theory (eRST) and discourse dependencies, including multiple concurrent and non-projective relations * Discourse signal annotations classified into 9 major and 45 minor types indicating how the presence of a relation is marked (based on the Signaling Corpus scheme) * Shallow discourse relations following the PDTB v3 scheme * Five abstractive summaries for each document following strict, comparable guidelines across genres ? Note on Reddit data: token text is not contained in the release but can be downloaded with an included script. ? For more information and to search or download the corpus online, see the corpus website . ? Best wishes, The GUM team ? PS ? if you like GUM, also check out our automatically annotated AMALGUM corpus! ? ? ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: