[Corpora-List] Spellchecker evaluation corpus
Martin Reynaert
reynaert at uvt.nl
Mon Apr 11 14:15:22 UTC 2011
Dear list,
We have so far heard interesting news about German and English spelling
correction benchmarks. Thank you Yannick, Eric and John! And, of course,
Stefan for bringing up this remarkable state of affairs!
About two years ago I contacted dr. A. Zamora to enquire whether the
list of 50,000 English typo's and their corrected form he collected
together with dr. J.J. Pollock in about 1983 could perhaps be made
available. He informed me that it is 'lost in the mists of time'...
Another missed chance, I presume, was when the BNC was rejuvenated and
`corrected' a couple of years ago. I sent in a list of over 3,000 typo's
and won a personal copy of the new XML version. Given that and the
original version, it might yet be possible to quickly derive a nice
benchmark for English...
In the context of search-engine query spelling correction, Bing and
Microsoft Research currently have a challenge running. ( Cf.
http://web-ngram.research.microsoft.com/spellerchallenge/ ) A large
training data set is provided. The test set, however, is not going to be
released. For systems that require training, the MS training data might
be used in 10-fold experiments, one left-out fold being used for
testing. This would be another option open to us at this stage.
Trevor: In a 2006 paper I have published statistics about typographical
errors gathered from the Reuters RCV1 corpus for English (12,094 pairs
of errors /attested corrections, over 3,000 of these also occurred in
the BNC) and from a corpus of Dutch newspapers (9,152 pairs). Both lists
have grown considerably since as I have on and off put more effort into
them. These lists are not to date publicly available due to IPR-issues,
however. More about this later.
John's example from what is the newer version of what used to be more
commonly known as the Birkbeck spelling error corpus by dr. Mitton,
shows that it is geared far more to what are more cognitive errors
rather than typographical errors. In fact this particular pair
constitutes what are called 'confusables' aka real-word errors. I
strongly agree with John that we need several kinds of benchmark sets
and have written about that in an LREC paper in 2008 (available from
http://ilk.uvt.nl/publications ).
Another resource that is sometimes used in evaluating spelling
correction systems is the list provided by Kevin Atkinson, the maker of
Aspell. These are isolated errors coupled to their alleged corrections.
I have strong doubts about some of these pairings, e.g. *amification
corrected as amplification (source:
http://aspell.net/test/cur/batch0.tab). The point is that in any case
for larger edit or Levenshtein distances (i.e. distance 2 in the
particular example) between a non-word and its correct form, one needs
to have the context the error appeared in. This is best exemplified by
an example from my PhD work, where the non-word *onjections might have
to be resolved to either 'injections' or 'objections'. (Available from:
http://ilk.uvt.nl/~mre/).
(This is a laboratory sentence:)
Her vehement *onjections to these painful *onjections were based on
solid medical evidence, as well as a hearty dislike of needles.
For English, I intend to someday, perhaps soon, initiate the necessary
negotiations with LDC to find a solution for my RCV1 error list...
For Dutch, we are at ILK working on a large spelling (and other lexical)
errors benchmark based on a selection of texts (up to book-length, from
a variety of text types). IPR-issues for these texts have all been
settled in the framework of SoNaR, the Reference corpus of contemporary
written Dutch we are currently building. We will notify the list as soon
as this benchmark is available.
In part to help facilitate building this benchmark, we are also
currently proposing a new xml-format, called FoLiA (Format for
Linguistic Annotation). More at: http://ilk.uvt.nl/folia
To conclude, I would like to repeat here what I have been proposing
elsewhere, namely that we indeed need shareable benchmark sets, for a
range of languages, but that we also need to work towards a consensus
regarding the actual evaluation metrics we (should) use.
I would be interested in proposals for collaboration towards building
benchmark sets for more languages.
Martin Reynaert
ILK
TiCC
Tilburg University
The Netherlands
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list