19.296, FYI: REG Challenge 2008: First Call for Participation
LINGUIST Network
linguist at LINGUISTLIST.ORG
Fri Jan 25 19:38:47 UTC 2008
LINGUIST List: Vol-19-296. Fri Jan 25 2008. ISSN: 1068 - 4875.
Subject: 19.296, FYI: REG Challenge 2008: First Call for Participation
Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
Reviews: Randall Eggert, U of Utah
<reviews at linguistlist.org>
Homepage: http://linguistlist.org/
The LINGUIST List is funded by Eastern Michigan University,
and donations from subscribers and publishers.
Editor for this issue: Matthew Lahrman <matt at linguistlist.org>
================================================================
To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.
===========================Directory==============================
1)
Date: 25-Jan-2008
From: Anja Belz < A.S.Belz at brighton.ac.uk >
Subject: REG Challenge 2008: First Call for Participation
-------------------------Message 1 ----------------------------------
Date: Fri, 25 Jan 2008 14:37:26
From: Anja Belz [A.S.Belz at brighton.ac.uk]
Subject: REG Challenge 2008: First Call for Participation
E-mail this message to a friend:
http://linguistlist.org/issues/emailmessage/verification.cfm?iss=19-296.html&submissionid=167321&topicid=6&msgnumber=1
First Call For Participation
Referring Expression Generation Challenge 2008:
To be held in conjunction with the 5th International Natural Language
Generation Conference (INLG 2008), June 12-14, 2008 Salt Fork, Ohio, USA
Following the success of the Pilot NLG Challenge on Attribute
Selection for Generating Referring Expressions (ASGRE) in September 2007,
we are organising a second NLG Challenge, the Referring Expression
Generation Challenge (REG 2008), to be presented and discussed during a
special session at INLG 2008. While the ASGRE Challenge focused on
attribute selection for definite references, the REG Challenge expands the
scope of the original to include both attribute selection and realisation,
while introducing a new task involving references to named entities in context.
The REG Challenge has eight submission tracks and two different data sets.
It maintains the ASGRE Challenge's emphasis on openness to alternative
task definitions and evaluation methods, and will involve both automatic
and task-based evaluation.
Contents of this Call:
1. Background
2. Generation of Referring Expressions
3. REG Challenge Data Sets
4. REG Challenge Tasks and Submission Tracks
5. Evaluation
6. Participation
7. Proceedings and Presentations
8. Important Dates
9. Organisation
1. Background
Over the past few years, the need for comparative, quantitatively evaluated
results has been increasingly felt in the field of NLG. Following a number
of discussion sessions at NLG meetings, a workshop dedicated to the topic
was held with NSF support in Arlington, Va., US, in April 2007 (see
http://www.ling.ohio-state.edu/~mwhite/nlgeval07/). At this workshop, a
decision was taken to organise a pilot shared task evaluation challenge,
focussing on the area of GRE because of the broad consensus that has arisen
among researchers on the nature and scope of this problem. The First NLG
Challenge on Attribute Selection for Generating Referring Expressions
(ASGRE), was held in Copenhagen in September 2007 in conjunction with the
UCNLG+MT Workshop. It was a successful pilot, both in terms of
participation and in the variety and quality of submissions received see
http://www.csd.abdn.ac.uk/~agatt/home/pubs/asgre2007.pdf). With 18 initial
registrations, and final submissions from six teams comprised of 13
researchers submitting outputs from 22 different systems, community
interest was substantial.
Several aspects of the ASGRE Challenge were intended to promote an approach
to shared-task evaluation where community interests feed directly into the
nature and evaluation of tasks. This is important in order to counteract a
potential narrowing of scope, where a shared task, rather than reflecting
community interest, plays a causal role in shaping those interests. The
most important of these aspects were:
* A wide range of evaluation criteria, involving both automatic and
task-based, intrinsic and extrinsic methods.
* An Open Category Track which enabled researchers to submit reports
describing novel approaches involving the shared dataset, while opting out
of the competitive element.
* An Evaluation Methods Track for submissions with novel proposals for
evaluation of the shared task.
* Self-evaluation: participants computed scores for the development data
set, using code supplied by the organisers.
2. Generation of Referring Expressions (GRE):
Since the foundational work of authors such as Appelt, Kronfeld, Grosz,
Joshi, Dale and Reiter, GRE has been the subject of intensive research in
the NLG community, giving rise to significant consensus on the GRE problem
definition, as well as the nature of the inputs and outputs of GRE
algorithms. This is particularly true of the subtask of attribute
selection for definite referring expressions (REs),
perhaps the most widely researched NLG subtask. A succinct definition of
the attribute selection task is given by Bohnet and Dale (2005):
''Given a symbol corresponding to an intended referent, how do we work out
the semantic content of a referring expression that uniquely identifies the
entity in question?''
This was precisely the task definition in the ASGRE Challenge. The REG
Challenge adds tasks on realisation of definite REs and choice of type of
RE in discourse context, aiming to cover a larger subset of NLG research
interests.
3. REG Challenge Data--TUNA Corpus of Referring Expressions:
The TUNA Corpus consists of a set of human-produced descriptions of objects
in a visual domain of pictures of furniture or people, annotated at the
semantic level with a domain representation. It was collected during an
elicitation experiment, in which one between-subjects condition controlled
the use of the location of an object in descriptions (+/-Location). The
version of the TUNA data to be used in the REG Challenge will, in addition
to attribute sets, include human-produced descriptions, the corresponding
pictures of domain objects, and the applicable experimental condition.
While ASGRE participants were never shown the test set outputs, we feel it
is more appropriate to use a new test set with unseen inputs as well as
unseen outputs. We are therefore creating 50 new corpus items for each
subdomain (people and furniture), in experimental conditions that replicate
the original TUNA elicitation experiments. This time we are obtaining
three descriptions for each test item which will allow us to compare peer
outputs to several human-produced descriptions, resulting in a more
reliable assessment of humanlikeness.
The TUNA Corpus is described in greater detail here: Gatt, A., van der
Sluis, I., and van Deemter, K. (2007). Evaluating algorithms for the
generation of referring expressions using a balanced corpus. Proceedings of
the 11th European Workshop on Natural Language Generation, ENLG-07.
See also http://www.csd.abdn.ac.uk/research/evaluation for details of the
corpus.
GREC Corpus of named entity references in context:
The GREC corpus consists of just over 2,000 short introductory texts from
Wikipedia entries containing about 18,000 annotated referring expressions
in total. The annotated references in each text are to a single entity
which constitutes the main subject or topic of the text. The texts fall
into five domains: cities, countries, rivers, people and mountains. A
subset of the corpus (100 texts), which will serve as the test set for the
tasks involving GREC, contains three additional referring expressions for
each reference in the original texts, as selected by human participants
during an experiment.
We are preparing a second test set, also containing multiple
human-selected alternatives for each reference. The domain of this test
set will be different from the five contained in the corpus, and will not
be revealed to participants in advance.
Our use of this corpus represents an effort to extend the scope of the REG
Challenge to take into account the effect of discourse context on the form
that a referring expression should take.
An earlier version of the GREC corpus (containing just over 1,000 texts) is
described in greater detail here: Belz, A. and Varges, S. (2007).
Generation of Repeated References to Discourse Entities. Proceedings of
the 11th European Workshop on Natural Language Generation, ENLG-07.
4. REG Challenge Tasks and Submission Tracks--Summary of submission tracks:
1. Task 1 (TUNA-AS): Attribute selection for referring expressions.
2. Task 2 (TUNA-R): Realisation of referring expressions.
3. Task 3 (TUNA-REG): Attribute selection and realisation combined.
4. TUNA Open Track: Any work involving the TUNA data.
5. TUNA Evaluation Methods: Any work involving evaluation of Tasks 1-3.
6. Task 4 (GREC): Named Entity generation: given a referent, a
discourse context and a list of possible referring expressions, select the
referring expression most appropriate in the context.
7. GREC Open Track: Any work involving the GREC data.
8. GREC Evaluation Methods: Any work involving evaluation of Task 4.
Open Tracks and Evaluation Methods Tracks:
The open tracks act to prevent an overly narrow task definition from
dominating in an otherwise varied research field, and the evaluation
methods tracks allow researchers to contribute evaluation methods that they
consider most appropriate. The idea is that such alternative tasks and
evaluation methods will become part of future evaluation events (e.g. the
MASI metric included this year was proposed by Advaith Siddharthan,
Cambridge University, last year).
Task 1 (TUNA-AS):
This is the same task as in the ASGRE Challenge (i.e. mapping from domain
representations to attribute sets), but with a previously unseen test data
set which will have multiple reference descriptions. The inclusion of this
task will allow participants to try and improve over the 2007 systems
(something called a `Progress Test' in the NIST-led MT evaluations), and
will allow some researchers to participate who were not able to in 2007.
See also http://www.csd.abdn.ac.uk/research/evaluation for details of data
and task definition used in the ASGRE Challenge.
Task 2 (TUNA-R):
In Task 2, participants need to create systems that map sets of
attribute-value pairs to natural language descriptions. For example, {
type:fan, colour:red, size:large } could map to ''the large red fan''
(among other possibilities). Participants can choose to either maximise
similarity with the human-produced descriptions in the corpus, or to
maximise optimality from the point of view of human comprehension and fast
identification of referents (see also evaluation section below).
Task 3 (TUNA-REG):
Task 3 combines Tasks 1 and 2, i.e. participating systems need to map from
a domain representation to a natural language description that describes
the target referent. Again, participants can choose which of the
evaluation criteria to optimise for.
One important aspect of the TUNA tasks is that the template realiser and
some of the participant attribute-selection systems from the ASGRE
Challenge will be made available to REG Challenge participants. Thus,
teams participating in these tasks can choose to focus on just one of the
components (mapping from input domains to attribute sets or mapping from
attribute sets to realisations). This is especially relevant to
participants in Task 3, where a team might choose to invest more effort on
one of the two components, while using an off-the-shelf system for the
other. Such reuse of components has been found useful in other evaluation
initiatives such as the TC-STAR text-to-speech evaluation initiative.
Task 4 (GREC):
In the shared-task version of the GREC corpus, every main subject reference
(MSR) is annotated with a set of possible MSRs. This set is automatically
generated by collecting all MSRs that occur in the same text, and applying
some generic rules that add a range of default options. The set of
possible MSRs along with the surrounding text (including annotations) forms
the input, and a single MSR selected for
a given slot forms the output. Participating systems need to
implement this mapping. This is a (simplified) example of an input, where
the reference output in the corpus is ''Brunei'':
<TEXT ID=''1''>
<PARAGRAPH>
...
<REF ID=''1.2'' TYPE=''np-subj'' CAT=''country'' ALT=''Brunei;
Brunei itself; _; it; it itself; that; that itself; the country; the
country itself; which; which itself''>Brunei</REF>, the remnant of a very
powerful sultanate, became independent from Great Britain in 1984.
</PARAGRAPH>
<PARAGRAPH>
...
</PARAGRAPH>
</TEXT>
The GREC task is to decide, given a set of alternative referring
expressions, as well as the context of the reference, which of the
alternatives is the most appropriate. The main focus in the Humanlikeness
evaluation of this task (see following section) will be on the relatively
coarse-grained choice between (a) common-noun references (e.g. ''The
country''); (b) proper-name references (e.g. ''Brunei''); and (c)
pronominal references (e.g. ''it''). Under this
conception, a system is considered to have made a correct choice relative
to a human if it selects a reference of the same type as the human.
Secondarily, a more fine-grained evaluation will be carried out, in which a
system's actual REs (rather than their types) will be assessed.
5. Evaluation:
All data sets will be divided into training, development and test data.
Participants will compute evaluation scores on the development set (using
code provided by us), and the organisers will perform evaluations on the
test data set. We will again use a range of different evaluation methods,
including intrinsic and extrinsic, automatically assessed and
human-evaluated, as shown in the overview below. Intrinsic evaluations
assess properties of peer systems in their own right, whereas extrinsic
evaluations assess the effect of a
peer system on something that is external to it, such as its effect on
human performance at a given task or the added value it brings to an
application.
Task(s): Criteria: Type of evaluation: Evaluation Methods:
TUNA-AS Humanlikeness Intrinsic/automatic Accuracy, Dice, MASI
Minimality, Uniqueness Intrinsic/automatic Proportion of
minimal/unique outputs
TUNA-R Humanlikeness Intrinsic/automatic Accuracy, BLEU, NIST,
string-edit distance
TUNA-REG Ease of comprehension Extrinsic/human Self-paced reading in
identification experiment
Referential Clarity Extrinsic/human Speed and accuracy in
identification experiment
GREC Humanlikeness Intrinsic/automatic Accuracy, BLEU, NIST,
string-edit distance
Ease of comprehension Extrinsic/human Self-paced reading in
identification experiment
Referential Clarity Extrinsic/human Speed and accuracy in
identification experiment
Intrinsic/human Direct human assessment
Coherence Intrinsic/human Direct human assessment
Extrinsic evaluations: For TUNA Tasks 2 and 3, we are planning similar
task-based experiments as in the ASGRE Challenge. However, we will present
the referring expression and the images of referents separately in two
steps, so that subjects are first shown the RE and can then bring up the
images when they are ready. We will measure reading speed in the first
step (as an estimation of ease of comprehension), and identification speed
and identification accuracy (referential clarity) in the second step, for
which we will again use the free high-precision software DMDX and Time-DX
(Forster & Forster, 2003).
For the GREC task, where there is no obvious set of distractor
entities, we envisage an experimental scenario in which participants are
asked to decide, for a given NP, whether it refers to the main subject of
the text. For example, the task may be to decide for all personal and
relative pronouns in a text whether the intended referent is the main
subject of the text, or not. We will record identification speed and
identification accuracy (referential clarity). Subjects will self-pace
their reading of the texts, which will enable us to also measure reading
speed (ease of comprehension). We are currently preparing pilot experiments
to test possible approaches.
Intrinsic automatic evaluations: We will assess humanlikeness, the
similarity of peer outputs to sets of human-produced reference `outputs',
by a range of automatic metrics. For Task 1, we will again use the Dice
coefficient, and additionally accuracy (the proportion of exact matches),
and a measure called MASI (Measuring Agreement on Set-valued Items) which
is slightly biased in favour of similarity where one set is a subset of the
other (Passonneau, 2006). For Tasks 2, 3 and 4, we will use accuracy,
string-edit distance, BLEU, NIST.
For backwards comparability with the ASGRE Challenge we will also again
assess minimality and uniqueness of attribute sets in Task 1.
Intrinsic human-assessed evaluations: This is the type of human evaluation
in the MT-Eval and DUC evaluation campaigns. We will train human evaluators
in assessing texts for the two DUC criteria of referential clarity and
coherence and will largely follow DUC methodology (Trang Dang, 2006).
While the TUNA domain is likely to be too simple for this type of
evaluation to show significant differences, we will use it on the more
varied GREC domain. Having these additional evaluations will also act as a
fall-back in a task type (named entity reference generation) for which
there is no evaluation experience to draw upon.
General system evaluation:
Subject to feasibility, (implementations of) the submitted systems may also
be compared in terms of their processing efficiency.
6. Participation:
At this point we would like anybody who is potentially interested in
participating in the REG Challenge to register via the REG homepage
(http://www.nltg.brighton.ac.uk/research/reg08), completing and submitting
a preliminary registration form. Upon preliminary registration,
participants will be sent sample data for the four shared tasks and
detailed task definitions, including input and output specifications. A
Participant's Pack with full details, and a more exhaustive description of
the datasets, input and output specifications will subsequently be distributed.
7. Proceedings and Presentations:
The REG Challenge 2008 meeting will be part of INLG'08. There will be a
special session in the conference programme for an overview of the
participating systems, presentation of evaluation results and open
discussion. The participating systems will additionally be presented in
the form of 2-page papers in the conference proceedings, and posters during
the INLG'08 poster session.
REG Challenge Papers will not undergo a selection procedure with multiple
reviews, but the organisers reserve the right to reject material which is
not appropriate given the participation guidelines. Page limits are the
same for all tracks: papers should not exceed 2 (two) pages in length,
including diagrams and bibliography. Authors should follow the INLG'08
style guidelines.
8. Important Dates:
Oct 17, 2007 INLG'08 First Call for papers, including announcement of REG
Challenge
Jan 03, 2007 REG Challenge 2008 First Call for Participation;
Preliminary registration open; sample data available
Jan 28, 2008 Release of training and development data sets for all tasks
Mar 17, 2008 Test data becomes available
Mar 17-Apr 07 Test data submission period: participants can download test
data at any time,
but must submit system report first and must submit outputs
within 48 hours
Apr 07, 2008 Final deadline for submission of test data outputs
Apr 07-May 10 Evaluation period
Jun 12, 2008 REG Challenge meeting at INLG'08
9. Organisation:
Anja Belz, NLTG, University of Brighton, UK
Albert Gatt, Computing Science, University of Aberdeen, UK
REG Challenge homepage: http://www.nltg.brighton.ac.uk/research/reg08
REG Challenge email: gre-stec -AT- itri.brighton.ac.uk
Linguistic Field(s): Computational Linguistics
-----------------------------------------------------------
LINGUIST List: Vol-19-296
More information about the LINGUIST
mailing list