[Corpora-List] Referring Expression Generation Challenge 2008: First Call for Participation

Anja Belz asb at brighton.ac.uk
Fri Jan 4 14:22:27 UTC 2008


*** Apologies, the correct url for the Challenge homepage is actually:
http://www.itri.brighton.ac.uk/research/reg08 ***


FIRST CALL FOR PARTICIPATION

REFERRING EXPRESSION GENERATION CHALLENGE 2008
----------------------------------------------

To be held in conjunction with the 5th International Natural Language
Generation Conference (INLG 2008), June 12-14, 2008 Salt Fork, Ohio,
USA

Following the success of the Pilot NLG Challenge on Attribute
Selection for Generating Referring Expressions (ASGRE) in September
2007, we are organising a second NLG Challenge, the Referring
Expression Generation Challenge (REG 2008), to be presented and
discussed during a special session at INLG 2008.  While the ASGRE
Challenge focused on attribute selection for definite references, the
REG Challenge expands the scope of the original to include both
attribute selection and realisation, while introducing a new task
involving references to named entities in context.

The REG Challenge has eight submission tracks and two different data
sets.  It maintains the ASGRE Challenge's emphasis on openness to
alternative task definitions and evaluation methods, and will involve
both automatic and task-based evaluation.

Contents of this Call:

1. Background
2. Generation of Referring Expressions
3. REG Challenge Data Sets
4. REG Challenge Tasks and Submission Tracks
5. Evaluation
6. Participation
7. Proceedings and Presentations
8. Important Dates
9. Organisation


1. Background
-------------

Over the past few years, the need for comparative, quantitatively
evaluated results has been increasingly felt in the field of
NLG. Following a number of discussion sessions at NLG meetings, a
workshop dedicated to the topic was held with NSF support in
Arlington, Va., US, in April 2007 (see
http://www.ling.ohio-state.edu/~mwhite/nlgeval07/).  At this workshop,
a decision was taken to organise a pilot shared task evaluation
challenge, focussing on the area of GRE because of the broad consensus
that has arisen among researchers on the nature and scope of this
problem.  The First NLG Challenge on Attribute Selection for
Generating Referring Expressions (ASGRE), was held in Copenhagen in
September 2007 in conjunction with the UCNLG+MT Workshop.  It was a
successful pilot, both in terms of participation and in the variety
and quality of submissions received (see
http://www.csd.abdn.ac.uk/~agatt/home/pubs/asgre2007.pdf).  With 18
initial registrations, and final submissions from six teams comprised
of 13 researchers submitting outputs from 22 different systems,
community interest was substantial.

Several aspects of the ASGRE Challenge were intended to promote an
approach to shared-task evaluation where community interests feed
directly into the nature and evaluation of tasks.  This is important
in order to counteract a potential narrowing of scope, where a shared
task, rather than reflecting community interest, plays a causal role
in shaping those interests.  The most important of these aspects were:

  * A wide range of evaluation criteria, involving both automatic and
    task-based, intrinsic and extrinsic methods.

  * An Open Category Track which enabled researchers to submit reports
    describing novel approaches involving the shared dataset, while
    opting out of the competitive element.

  * An Evaluation Methods Track for submissions with novel proposals for
    evaluation of the shared task.

  * Self-evaluation: participants computed scores for the development data
    set, using code supplied by the organisers.


2. Generation of Referring Expressions (GRE)
--------------------------------------------

Since the foundational work of authors such as Appelt, Kronfeld,
Grosz, Joshi, Dale and Reiter, GRE has been the subject of intensive
research in the NLG community, giving rise to significant consensus on
the GRE problem definition, as well as the nature of the inputs and
outputs of GRE algorithms.  This is particularly true of the subtask
of attribute selection for definite referring expressions (REs),
perhaps the most widely researched NLG subtask.  A succinct definition
of the attribute selection task is given by Bohnet and Dale (2005):

"Given a symbol corresponding to an intended referent, how do we work
out the semantic content of a referring expression that uniquely
identifies the entity in question?"

This was precisely the task definition in the ASGRE Challenge.  The
REG Challenge adds tasks on realisation of definite REs and choice of
type of RE in discourse context, aiming to cover a larger subset of NLG
research interests.


3. REG Challenge Data
---------------------

TUNA Corpus of Referring Expressions:
- - - - - - - - - - - - - - - - - - -

The TUNA Corpus consists of a set of human-produced descriptions of
objects in a visual domain of pictures of furniture or people,
annotated at the semantic level with a domain representation.  It was
collected during an elicitation experiment, in which one
between-subjects condition controlled the use of the location of an
object in descriptions (+/-Location).  The version of the TUNA
data to be used in the REG Challenge will, in addition to attribute
sets, include human-produced descriptions, the corresponding pictures
of domain objects, and the applicable experimental condition.

While ASGRE participants were never shown the test set outputs, we
feel it is more appropriate to use a new test set with unseen inputs
as well as unseen outputs.  We are therefore creating 50 new corpus
items for each subdomain (people and furniture), in experimental
conditions that replicate the original TUNA elicitation experiments.
This time we are obtaining three descriptions for each test item which
will allow us to compare peer outputs to several human-produced
descriptions, resulting in a more reliable assessment of
humanlikeness.

The TUNA Corpus is described in greater detail here: Gatt, A., van der
Sluis, I., and van Deemter, K. (2007). Evaluating algorithms for the
generation of referring expressions using a balanced corpus.
Proceedings of the 11th European Workshop on Natural Language
Generation, ENLG-07.

See also http://www.csd.abdn.ac.uk/research/evaluation for details of
the corpus.


GREC Corpus of named entity references in context:
- - - - - - - - - - - - - - - - - - - - - - - - -

The GREC corpus consists of just over 2,000 short introductory texts
from Wikipedia entries containing about 18,000 annotated referring
expressions in total.  The annotated references in each text are to a
single entity which constitutes the main subject or topic of the text.
The texts fall into five domains: cities, countries, rivers, people
and mountains.  A subset of the corpus (100 texts), which will serve
as the test set for the tasks involving GREC, contains three
additional referring expressions for each reference in the original
texts, as selected by human participants during an experiment.

We are preparing a second test set, also containing multiple
human-selected alternatives for each reference.  The domain of this
test set will be different from the five contained in the corpus, and
will not be revealed to participants in advance.

Our use of this corpus represents an effort to extend the scope of
the REG Challenge to take into account the effect of discourse context
on the form that a referring expression should take.

An earlier version of the GREC corpus (containing just over 1,000
texts) is described in greater detail here: Belz, A. and Varges,
S. (2007).  Generation of Repeated References to Discourse
Entities.  Proceedings of the 11th European Workshop on Natural
Language Generation, ENLG-07.


4. REG Challenge Tasks and Submission Tracks
--------------------------------------------

Summary of submission tracks:

1. Task 1 (TUNA-AS): Attribute selection for referring expressions.
2. Task 2 (TUNA-R): Realisation of referring expressions.
3. Task 3 (TUNA-REG): Attribute selection and realisation combined.
4. TUNA Open Track: Any work involving the TUNA data.
5. TUNA Evaluation Methods: Any work involving evaluation of Tasks 1-3.
6. Task 4 (GREC): Named Entity generation: given a referent, a
   discourse context and a list of possible referring expressions, select
   the referring expression most appropriate in the context.
7. GREC Open Track: Any work involving the GREC data.
8. GREC Evaluation Methods: Any work involving evaluation of Task 4.

Open Tracks and Evaluation Methods Tracks:

The open tracks act to prevent an overly narrow task definition from
dominating in an otherwise varied research field, and the evaluation
methods tracks allow researchers to contribute evaluation methods that
they consider most appropriate.  The idea is that such alternative
tasks and evaluation methods will become part of future evaluation
events (e.g. the MASI metric included this year was proposed by
Advaith Siddharthan, Cambridge University, last year).

Task 1 (TUNA-AS):

This is the same task as in the ASGRE Challenge (i.e. mapping from
domain representations to attribute sets), but with a previously
unseen test data set which will have multiple reference descriptions.
The inclusion of this task will allow participants to try and improve
over the 2007 systems (something called a `Progress Test' in the
NIST-led MT evaluations), and will allow some researchers to
participate who were not able to in 2007.

See also http://www.csd.abdn.ac.uk/research/evaluation for details of
data and task definition used in the ASGRE Challenge.

Task 2 (TUNA-R):

In Task 2, participants need to create systems that map sets of
attribute-value pairs to natural language descriptions.  For example,
{ type:fan, colour:red, size:large } could map to "the large red fan"
(among other possibilities).  Participants can choose to either
maximise similarity with the human-produced descriptions in the
corpus, or to maximise optimality from the point of view of human
comprehension and fast identification of referents (see also
evaluation section below).

Task 3 (TUNA-REG):

Task 3 combines Tasks 1 and 2, i.e. participating systems need to map
from a domain representation to a natural language description that
describes the target referent.  Again, participants can choose which
of the evaluation criteria to optimise for.

One important aspect of the TUNA tasks is that the template realiser
and some of the participant attribute-selection systems from the ASGRE
Challenge will be made available to REG Challenge participants.  Thus,
teams participating in these tasks can choose to focus on just one of
the components (mapping from input domains to attribute sets or
mapping from attribute sets to realisations).  This is especially
relevant to participants in Task 3, where a team might choose to
invest more effort on one of the two components, while using an
off-the-shelf system for the other. Such reuse of components has been
found useful in other evaluation initiatives such as the TC-STAR
text-to-speech evaluation initiative.

Task 4 (GREC):

In the shared-task version of the GREC corpus, every main subject
reference (MSR) is annotated with a set of possible MSRs.  This set is
automatically generated by collecting all MSRs that occur in the same
text, and applying some generic rules that add a range of default
options.  The set of possible MSRs along with the surrounding text
(including annotations) forms the input, and a single MSR selected for
a given slot forms the output.  Participating systems need to
implement this mapping.  This is a (simplified) example of an input,
where the reference output in the corpus is "Brunei":

  <TEXT ID="1">
    <PARAGRAPH>
        ...
        <REF ID="1.2" TYPE="np-subj" CAT="country" ALT="Brunei; Brunei
         itself _; it; it itself; that; that itself; the country; the
         country itself; which; which itself">Brunei</REF>, the
        remnant of a very powerful sultanate, became independent from
        Great Britain in 1984.
    </PARAGRAPH>
    <PARAGRAPH>
        ...
    </PARAGRAPH>
  </TEXT>

The GREC task is to decide, given a set of alternative referring
expressions, as well as the context of the reference, which of the
alternatives is the most appropriate.  The main focus in the
Humanlikeness evaluation of this task (see following section) will be
on the relatively coarse-grained choice between (a) common-noun
references (e.g. "The country"); (b) proper-name references (e.g
"Brunei"); and (c) pronominal references (e.g. "it").  Under this
conception, a system is considered to have made a correct choice
relative to a human if it selects a reference of the same type as the
human.  Secondarily, a more fine-grained evaluation will be carried
out, in which a system's actual REs (rather than their types) will be
assessed.


5. Evaluation
-------------

All data sets will be divided into training, development and test
data.  Participants will compute evaluation scores on the development
set (using code provided by us), and the organisers will perform
evaluations on the test data set.  We will again use a range of
different evaluation methods, including intrinsic and extrinsic,
automatically assessed and human-evaluated, as shown in the overview
below.  Intrinsic evaluations assess properties of peer systems in
their own right, whereas extrinsic evaluations assess the effect of a
peer system on something that is external to it, such as its effect on
human performance at a given task or the added value it brings to an
application.

Task(s): Criteria:              Type of evaluation:  Evaluation Methods:

TUNA-AS  Humanlikeness          Intrinsic/automatic  Accuracy, Dice, MASI
         Minimality, Uniqueness Intrinsic/automatic  Proportion of minimal/unique outputs

TUNA-R   Humanlikeness          Intrinsic/automatic  Accuracy, BLEU, NIST, string-edit distance
TUNA-REG Ease of comprehension  Extrinsic/human      Self-paced reading in identification experiment
         Referential Clarity    Extrinsic/human      Speed and accuracy in identification experiment

GREC     Humanlikeness          Intrinsic/automatic  Accuracy, BLEU, NIST, string-edit distance
         Ease of comprehension  Extrinsic/human      Self-paced reading in identification experiment
         Referential Clarity    Extrinsic/human      Speed and accuracy in identification experiment
                                Intrinsic/human      Direct human assessment
         Coherence              Intrinsic/human      Direct human assessment

Extrinsic evaluations: For TUNA Tasks 2 and 3, we are planning similar
task-based experiments as in the ASGRE Challenge.  However, we will
present the referring expression and the images of referents
separately in two steps, so that subjects are first shown the RE and
can then bring up the images when they are ready.  We will measure
reading speed in the first step (as an estimation of ease of
comprehension), and identification speed and identification accuracy
(referential clarity) in the second step, for which we will again use
the free high-precision software DMDX and Time-DX (Forster & Forster,
2003).

For the GREC task, where there is no obvious set of distractor
entities, we envisage an experimental scenario in which participants
are asked to decide, for a given NP, whether it refers to the main
subject of the text.  For example, the task may be to decide for all
personal and relative pronouns in a text whether the intended referent
is the main subject of the text, or not.  We will record
identification speed and identification accuracy (referential
clarity).  Subjects will self-pace their reading of the texts, which
will enable us to also measure reading speed (ease of
comprehension). We are currently preparing pilot experiments to test
possible approaches.

Intrinsic automatic evaluations: We will assess humanlikeness, the
similarity of peer outputs to sets of human-produced reference
`outputs', by a range of automatic metrics.  For Task 1, we will again
use the Dice coefficient, and additionally accuracy (the proportion of
exact matches), and a measure called MASI (Measuring Agreement on
Set-valued Items) which is slightly biased in favour of similarity
where one set is a subset of the other (Passonneau, 2006).  For Tasks
2, 3 and 4, we will use accuracy, string-edit distance, BLEU, NIST.
For backwards comparability with the ASGRE Challenge we will also
again assess minimality and uniqueness of attribute sets in Task 1.

Intrinsic human-assessed evaluations: This is the type of human
evaluation in the MT-Eval and DUC evaluation campaigns. We will train
human evaluators in assessing texts for the two DUC criteria of
referential clarity and coherence and will largely follow DUC
methodology (Trang Dang, 2006).  While the TUNA domain is likely to be
too simple for this type of evaluation to show significant
differences, we will use it on the more varied GREC domain.  Having
these additional evaluations will also act as a fall-back in a task
type (named entity reference generation) for which there is no
evaluation experience to draw upon.

General system evaluation:

Subject to feasibility, (implementations of) the submitted systems may
also be compared in terms of their processing efficiency.


6. Participation
----------------

At this point we would like anybody who is potentially interested in
participating in the REG Challenge to register via the REG homepage
(http://www.nltg.brighton.ac.uk/research/reg08), completing and
submitting a preliminary registration form.  Upon preliminary
registration, participants will be sent sample data for the four
shared tasks and detailed task definitions, including input and output
specifications.  A Participant's Pack with full details, and a more
exhaustive description of the datasets, input and output
specifications will subsequently be distributed.


7. Proceedings and Presentations
--------------------------------

The REG Challenge 2008 meeting will be part of INLG'08.  There will be
a special session in the conference programme for an overview of the
participating systems, presentation of evaluation results and open
discussion.  The participating systems will additionally be presented
in the form of 2-page papers in the conference proceedings, and
posters during the INLG'08 poster session.

REG Challenge Papers will not undergo a selection procedure with
multiple reviews, but the organisers reserve the right to reject
material which is not appropriate given the participation guidelines.
Page limits are the same for all tracks: papers should not exceed 2
(two) pages in length, including diagrams and bibliography.  Authors
should follow the INLG'08 style guidelines.


8. Important Dates
------------------

Oct 17, 2007   INLG'08 First Call for papers, including announcement of
               REG Challenge
Jan 03, 2007   REG Challenge 2008 First Call for Participation;
               Preliminary registration open; sample data available
Jan 28, 2008   Release of training and development data sets for all tasks
Mar 17, 2008   Test data becomes available
Mar 17-Apr 07  Test data submission period: participants can download test
               data at any time, but must submit system report first and
               must submit outputs within 48 hours
Apr 07, 2008   Final deadline for submission of test data outputs
Apr 07-May 10  Evaluation period
Jun 12, 2008   REG Challenge meeting at INLG'08


9. Organisation
---------------

Anja Belz, NLTG, University of Brighton, UK
Albert Gatt, Computing Science, University of Aberdeen, UK

REG Challenge homepage:  http://itri.brighton.ac.uk/research/reg08
REG Challenge email:     gre-stec -AT- itri.brighton.ac.uk




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list