[Corpora-List] Spellchecker evaluation corpus

Roger Mitton roger at dcs.bbk.ac.uk
Wed Apr 13 18:14:18 UTC 2011


A late response to Stefan's posting of 9 Apr.

The lack of a standard corpus for evaluating English spellcheckers may not be 
all that surprising.  Researchers focus on different aspects of spellchecking, 
and a corpus appropriate for testing one piece of work may be almost useless for 
testing another. Are we concentrating on detecting errors or can we take the 
error as given and concentrate on suggesting corrections? Are we happy to ignore 
real-word errors or, on the other hand, are they the focus of our research? Do 
we want to tackle the sort of misspellings made by people who have difficulty 
with spelling or are we correcting the occasional typo in otherwise correct 
text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if 
the errors are made by native speakers or second-language users of English? Is a 
set of errors (with their targets) adequate or is it essential to have context?  
If the latter, will a snippet of context do or do you need full documents?  Do 
we want to correct running text or queries to a search engine?  And so on.

My own work has focussed on trying to correct the mangled efforts of poor 
spellers. Years ago, I gathered various collections of misspellings and 
deposited them, with some documentation, in the Oxford Text Archive, who 
christened them the "Birkbeck error corpus". There is a file, derived from 
these, for download from my website, along with a couple of others:

http://www.dcs.bbk.ac.uk/~roger/corpora.html

More recently, my colleague Jenny Pedler has compiled a file specifically of 
real-word errors, in some context. This is also available for download:

http://www.dcs.bbk.ac.uk/~jenny/resources.html

Roger Mitton
Birkbeck, University of London

On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag
<sbordag at informatik.uni-leipzig.de> wrote:

> Hi everyone,
>
> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find any
> such test set for the spell checking task!
>
> Am I doing something wrong or is there no such data set? I know I can make
> synthetic tests systematically inserting, swapping etc. letters in my own
> test data, but this would give me results which I cannot compare to any
> other results. Hence, is there some accepted evaluation forum which I am
> missing because whenever I include spell check in any form in search queries
> I get lots of tutorials how to write a spellchecker and almost nothing
> else...
>
> Best regards,
> Stefan Bordag
>


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list