[Corpora-List] RTE3 OPTIONAL PILOT TASK: EXTENDING THE EVALUATION OF INFERENCES FROM TEXTS - CALL FOR PARTICIPATION

Danilo Giampiccolo giampiccolo at itc.it
Thu Feb 22 15:24:10 UTC 2007


Apologies for cross-postings.


RTE3 OPTIONAL PILOT TASK-CALL FOR PARTICIPATION
EXTENDING THE EVALUATION OF INFERENCES FROM TEXTS
 (http://www.pascal-network.org/Challenges/RTE3/)


PASCAL RTE has successfully trailblazed a path for evaluating the
capacity of systems to automatically infer information from texts.
However, it does not presently address all issues in textual entailment.
At least one new area is already being addressed this year within RTE3:
trialing the use of longer passage texts. This optional pilot explores
two other tasks closely related to textual entailment: differentiating
unknown from false/contradicts and providing justifications for answers.
This task will piggyback on the existing RTE3 Challenge infrastructure
and evaluation process by using the same test set but with a later
submission deadline for answers than the primary task. 

The goal of making a three-way decision of "YES", "NO" and "UNKNOWN" is
to drive systems to make more precise informational distinctions.   A
hypothesis being unknown on the basis of a text should be distinguished
from a hypothesis being shown false/contradicted by a text.  The goal
for providing justifications for decisions is to explore how eventual
users of tools incorporating entailment can be made to understand how
decisions were reached by a system. Users are unlikely to trust a system
that gives no explanation for its decisions.

The pilot task seeks participation from all interested parties, and we
hope that it will be of interest to many PASCAL RTE participants, and
can help inform the design of the main task for future RTE Challenges.
The US National Institute of Standards and Technology will perform
evaluation, using human assessors for the inference task.


EXTENDED TASK DESCRIPTION

* Everyone is invited to participate in the extended task.
* Teams participating in the extended task will be asked to treat the
RTE3 test data as blind test data until after they submit to the
extended task.
* Teams participating in the extended task submit a 3-way answer key for
the test set used in the primary task.  
* Optionally, a team can also submit a justification for how the answer
was derived for each pair.
* The 3-way answers use the same format as the standard PASCAL
submission, but are unranked (and allow 3 answers: YES, UNKNOWN, NO).
* A justification consists of a set of ASCII strings delimited by
begin/end tags.  The purpose of the justification is to explain to an
ordinary person (i.e., not a linguist or logician) why the given answer
is correct.  True examples should indicate the basis for concluding that
a hypothesis is true. Otherwise, the justification should indicate at
least one reason why the hypothesis does not follow from the text. In
either case, a system should provide any background, lexical, or world
knowledge that it uses in addressing a pair and indicate which parts of
the text are used to justify or differentiate from which parts of the
hypothesis. The format and content of justifications is intentionally
underspecified since we are interested in learning what makes a good
justification.
* People may submit up to 2 answer keys to the pilot task. The answers
need not be consistent with their submission to the main RTE task.
* The three-way decisions are made by splitting the "NO" category of the
primary task's gold standard answer key into NO and UNKNOWN categories.
The criterion for "NO" mirrors the standard of proof for PASCAL RTE's
"YES": it is very unlikely that the text and hypothesis could both be
true. NIST will determine a gold standard answer key, score the
submitted runs with it, and they will make the gold standard key
available.
* The 3-way answer key is scored using two metrics (on unranked
answers): accuracy and Fbeta=3 of precision and recall on the YES and NO
categories, with the weighting preferring high precision. (This allows a
system that opts for "UNKNOWN" when it is unsure of the answer to
receive reasonable credit.)
* NIST human assessors will also assign scores for some (relatively
small) subset of the justifications of the test set pairs. The subset to
use will be selected to include YES, NO, and UNKNOWN pairs.  The size of
the subset will be largely determined by how many submissions are
received and how difficult it is to assess the justifications.  
* Justifications will be scored on a 5 point scale for each of
correctness and usability.  'Usability' is whether the assessor can
comprehend the justification.  If (and only if) the justification
receives a high-enough score on the usability component, then the
assessor will assign a score for correctness.  A system will be marked
down for correctness if it made inferences which clearly do not follow
from the text and provided background knowledge, or failed to draw
inferences that were possible. 
* It is not possible to construct a gold standard answer key for
justifications.  NIST will compute some (to be determined) aggregate
score for the justification 
component for submitted runs.
* A report version of the extended task will be prepared in time for the
RTE-3 workshop (but separately, not within the usual ACL proceedings
process), and some time will be made available to discuss it during the
RTE-3 workshop. The timing would require participants to mostly write a
report on their systems prior to the release of results.

IMPORTANT DATES

* Guidelines distributed: Feb 23, 2007.
* A 3-way answer key for the RTE-3 development data is available: Feb
28, 2007.
* Sample justifications (8-10 illustrative examples and how they might
be judged) available: Mar 30, 2007.
* Submissions for the extended task are due April 30, 2007.
* Results for both parts of the extended task returned to participants
no later than June 7. 

REGISTRATION

For registration, further information and inquiries, please visit the
RTE3 website:
http://www.pascal-network.org/Challenges/RTE3/.

ORGANIZING COMMITTEE

This pilot is being organized by Christopher Manning
<manning at cs.stanford.edu>, Dan Moldovan <moldovan at languagecomputer.com>,
and Ellen Voorhees 
<ellen.voorhees at nist.gov>, with input from the PASCAL RTE Organizers. 

CONTACTS
Please direct any questions to the pilot organizers, putting "[RTE3]" in
the subject header.



More information about the Corpora mailing list