[Corpora-List] HOO2012 at BEA7: Preliminary Call for Participation --- Preposition and Determiner Error Correction

Robert Dale robert.dale at mq.edu.au
Fri Dec 23 00:39:16 UTC 2011


PRELIMINARY CALL FOR PARTICIPATION

CONTEXT

The HOO (Helping Our Own) Exercise [Dale and Kilgarriff 2010] is concerned
with correcting textual errors. The HOO Pilot Shared Task run in 2011 (see
[Dale and Kilgarriff 2011]) looked at a diverse range of error types in a
small set of documents. HOO 2012, which will be hosted by the Building
Educational Applications Workshop (see
http://www.cs.rochester.edu/~tetreaul/naacl-bea7.html) at NAACL 2012,
focusses on the correction of preposition and determiner errors in a large
collection of non-native speaker texts. These are widely recognized to be
amongst the most challenging aspects of English lexico-syntax for non-native
speakers to deal with: see [Leacock et al. 2010] for a review.

THE TASK

The goal of this task is to provide a forum for the comparative evaluation
of approaches to the correction of errors in the use of prepositions and
determiners by non-native speakers of English. Although these have already
been the focus of a considerable body of research in natural language
processing, so far it has been hard to compare the results delivered by
different teams as a consequence of different data sets and slightly
different task descriptions. This shared task provides a common dataset and
a shared evaluation framework as a means of overcoming these problems.

The HOO-2012 Prepositions and Determiners Shared Task follows on from the
HOO-2011 Shared Task Pilot Round held in 2011 as part of the 2011 European
Natural Language Generation Workshop. That task had a much broader focus on
all kinds of errors in non-native speaker writing, and use a much smaller
dataset (see http://www.clt.mq.edu.au/research/projects/hoo/). The
evaluation framework for HOO-2012 is an enhancement of the scheme developed
for HOO-2011, taking advantage of what was learned in that exercise.

THE DATA

The data to be used for the task is drawn from the Cambridge Learner Corpus
(CLC), and contains exam scripts written by students undertaking the First
Certificate in English (FCE) exams; it is used with the kind permission of
Cambridge University Press.

The data we are using has been converted from the mark-up provided in the
released version of the CLC FCE data to use the HOO annotation scheme. The
data to be released for training consists of 1000 exam scripts extracted
from the FCE dataset. A further subset of 100 exam scripts will be released
for testing and evaluation at the appropriate point in the schedule. We are
endeavouring to obtain fresh data for this stage of the exercise, but in the
event that this turns out not to be possible, we will use part of the
published FCE dataset that was held back from the training data. For this
reason, we would ask participants not to use the originally-released FCE
dataset for training purposes, but only the HOO-formatted subset that we
release independently.

EVALUATION

The evaluation methodology is essentially the same as that used in the HOO
Pilot Round, but limited to preposition and determiner errors only. Tools
will be provided which compute detection (lenient recognition, with at least
one character overlap), recognition (exact extent recognition of an error),
and correction (provision of an appropriate replacement string in addition
to recognition).

WHAT YOU SHOULD DO NOW

Register your interesting by sending your email address to
Robert.Dale at mq.edu.au. You'll be added to the HOO Google Groups list, where
over the next few weeks some fine-tuning of the shared task will be
discussed. Formal registration for the task will be as indicated in the
schedule below.

SCHEDULE

The current schedule for HOO-2012 is as follows.

    Friday 27th January: Website registration for participation in HOO-2012
opens; development data for the Shared Task released.
    Friday 6th April: Test data for evaluation released.
    Friday 13th April: Deadline for submissions from teams for evaluation.
    Monday 23rd April: Results of evaluation released.
    Friday May 4th: Final versions of team reports for proceedings due. 

ORGANISERS

Robert Dale, Macquarie University

REFERENCES

R. Dale and A. Kilgarriff [2010] Helping Our Own: Text massaging for
computational linguistics as a new shared task. In Proceedings of the 6th
International Natural Language Generation Conference, pages 261-265, Dublin,
Ireland, 7th-9th July 2010.

R. Dale and A. Kilgarriff [2011] Helping Our Own: The HOO 2011 Pilot Shared
Task. In Proceedings of the 13th European Workshop on Natural Language
Generation, Nancy, France, 28th-30th September 2011.

C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault [2010] Automated
Grammatical Error Detection for Language Learners. Synthesis Lectures on
Human Language Technologies. Morgan and Claypool.


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list