[Corpora-List] Spellchecker evaluation corpus

Mon Apr 11 14:15:22 UTC 2011

Dear list,

We have so far heard interesting news about German and English spelling 
correction benchmarks. Thank you Yannick, Eric and John! And, of course, 
Stefan for bringing up this remarkable state of affairs!

About two years ago I contacted dr. A. Zamora to enquire whether the 
list of 50,000 English typo's and their corrected form he collected 
together with dr. J.J. Pollock in about 1983 could perhaps be made 
available. He informed me that it is 'lost in the mists of time'...

Another missed chance, I presume, was when the BNC was rejuvenated and 
`corrected' a couple of years ago. I sent in a list of over 3,000 typo's 
and won a personal copy of the new XML version. Given that and the 
original version, it might yet be possible to quickly derive a nice 
benchmark for English...

In the context of search-engine query spelling correction, Bing and 
Microsoft Research currently have a challenge running. ( Cf. 
http://web-ngram.research.microsoft.com/spellerchallenge/ ) A large 
training data set is provided. The test set, however, is not going to be 
released. For systems that require training, the MS training data might 
be used in 10-fold experiments, one left-out fold being used for 
testing. This would be another option open to us at this stage.

Trevor: In a 2006 paper I have published statistics about typographical 
errors gathered from the Reuters RCV1 corpus for English (12,094 pairs 
of errors /attested corrections, over 3,000 of these also occurred in 
the BNC) and from a corpus of Dutch newspapers (9,152 pairs). Both lists 
have grown considerably since as I have on and off put more effort into 
them. These lists are not to date publicly available due to IPR-issues, 
however. More about this later.

John's example from what is the newer version of what used to be more 
commonly known as the Birkbeck spelling error corpus by dr. Mitton, 
shows that it is geared far more to what are more cognitive errors 
rather than typographical errors. In fact this particular pair 
constitutes what are called 'confusables' aka real-word errors. I 
strongly agree with John that we need several kinds of benchmark sets 
and have written about that in an LREC paper in 2008 (available from 
http://ilk.uvt.nl/publications ).

Another resource that is sometimes used in evaluating spelling 
correction systems is the list provided by Kevin Atkinson, the maker of 
Aspell. These are isolated errors coupled to their alleged corrections. 
I have strong doubts about some of these pairings, e.g. *amification 
corrected as amplification (source: 
http://aspell.net/test/cur/batch0.tab). The point is that in any case 
for larger edit or Levenshtein distances (i.e. distance 2 in the 
particular example) between a non-word and its correct form, one needs 
to have the context the error appeared in. This is best exemplified by 
an example from my PhD work, where the non-word *onjections might have 
to be resolved to either 'injections' or 'objections'. (Available from: 
http://ilk.uvt.nl/~mre/).

(This is a laboratory sentence:)

Her vehement *onjections to these painful *onjections were based on
solid medical evidence, as well as a hearty dislike of needles.

For English, I intend to someday, perhaps soon, initiate the necessary 
negotiations with LDC to find a solution for my RCV1 error list...

For Dutch, we are at ILK working on a large spelling (and other lexical) 
errors benchmark based on a selection of texts (up to book-length, from 
a variety of text types). IPR-issues for these texts have all been 
settled in the framework of SoNaR, the Reference corpus of contemporary 
written Dutch we are currently building. We will notify the list as soon 
as this benchmark is available.

In part to help facilitate building this benchmark, we are also 
currently proposing a new xml-format, called FoLiA (Format for 
Linguistic Annotation). More at: http://ilk.uvt.nl/folia

To conclude, I would like to repeat here what I have been proposing 
elsewhere, namely that we indeed need shareable benchmark sets, for a 
range of languages, but that we also need to work towards a consensus 
regarding the actual evaluation metrics we (should) use.

I would be interested in proposals for collaboration towards building 
benchmark sets for more languages.

Martin Reynaert
ILK
TiCC
Tilburg University
The Netherlands

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora