30.2259, Software: Second Release of the Phrase Detectives Corpus of Anaphoric Information

The LINGUIST List linguist at listserv.linguistlist.org
Thu May 30 21:57:44 UTC 2019


LINGUIST List: Vol-30-2259. Thu May 30 2019. ISSN: 1069 - 4875.

Subject: 30.2259, Software: Second Release of the Phrase Detectives Corpus of Anaphoric Information

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Thu, 30 May 2019 17:56:21
From: Massimo Poesio [m.poesio at qmul.ac.uk]
Subject: Second Release of the Phrase Detectives Corpus of Anaphoric Information

 
The DALI and NIEUW projects would like to announce the second release of the
Phrase Detectives corpus (Poesio et al, 2019), a corpus of documents annotated
with anaphoric (coreference) information collected using the Phrase Detectives
Game-With-A-Purpose

http://www.phrasedetectives.org

as part of an ongoing collaboration between Queen Mary University, the
University of Essex, the Linguistic Data Consortium, and other partners to
collect linguistic resources via games-with-a-purpose and citizen science:

https://lingoboingo.org/

This new version of the corpus includes significantly more data than the first
version release in 2016, and supplies for each markable both a substantial
number of judgments expressed by the players and a silver label calculated
using the probabilistic aggregation metod for anaphoric information proposed
in (Paun et al, 2018). To our knowledge, this release is the most extensive
collection of multiple judgments for anaphora resolution and one of the
largest in NLP.

The release consists of the documents whose annotation was completed as of
12th of October 2018–i.e.  8 judgments per markable were collected, and 4
validations per interpretation. In total, the current release consists of 542
documents, for a total of 408K tokens and 108K markables. The corpus is
divided in two parts: silver and gold. The current silver release consists of
497 documents, for a total of 384K tokens and 101K markables. The release also
includes the 45 documents in the first release, which were gold-annotated by
two expert annotators. We refer to the subset of the corpus for which both
gold and silver annotations are available as gold, as it is intended to be
used as a test set.  The gold subset consists of 45 documents for a total of
23K tokens and 6K markables.

In total, 2,235,664 judgments were collected from 1958 players about the
markables in the collection, of which 1,358,559 annotations and 867,844
validations.  On  average,  20.6  judgments  were  collected per markable,
12.6 annotations and 8 validations. 

The annotation scheme of the Phrase Detectives corpus is a simplified form of
the coding scheme used in the ARRAU corpus (Uryupina et al, 2019), also
distributed by the LDC. Players were asked to classify  markables as referring
or non-referring. Referring noun phrases could be classified either as
discourse-new or discourse-old  (referring to the same entity as a previous
mention). Two types of non-referring expressions are identified: expletives
and predicative NPs (called ‘properties’). Discourse-old markables include
so-called split antecedent plurals (as in "Mary met John. They had dinner
together.") The key property of this dataset is that all judgments expressed
by players are stored: 20 judgments per markable on average, up to 90 in one
case.  In addition, a silver label extracted from the judgments using the MPA
probabilistic annotation method (Paun et al, 2018) is also provided.

The dataset will be made available from the LDC over the summer, and can
already be downloaded from the DALI project GitHub repository:

https://github.com/dali-ambiguity/Phrase-Detectives-Corpus-2.1.4

The data are released in three formats: the original MAS-XML already used in
the first release, as well as in CONLL-style format and in CRAC 2018 format
(only for the silver label).

References:

Paun et al 2018: Silviu Paun, Jon Chamberlain, Udo Kruschwitz, Juntao Yu,
Massimo Poesio, 2018. A Probabilistic Annotation Model for Crowdsourcing
Coreference. Proc. of EMNLP.
https://www.aclweb.org/anthology/papers/D/D18/D18-1218/
Poesio  et al 2019: Massimo Poesio, Jon Chamberlain, Silviu Paun, Juntao Yu,
Alexandra Uma and Udo Kruschwitz, 2019. A Crowdsourced Corpus of Multiple
Judgments and Disagreement on Anaphoric Interpretation. Proc. of NAACL.


Linguistic Field(s): Computational Linguistics



------------------------------------------------------------------------------

***************************    LINGUIST List Support    ***************************
 The 2019 Fund Drive is under way! Please visit https://funddrive.linguistlist.org
  to find out how to donate and check how your university, country or discipline
     ranks in the fund drive challenges. Or go directly to the donation site:
               https://iufoundation.fundly.com/the-linguist-list-2019

                        Let's make this a short fund drive!
                Please feel free to share the link to our campaign:
                    https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-30-2259	
----------------------------------------------------------






More information about the LINGUIST mailing list