[Corpora-List] Spellchecker evaluation corpus
Roger Mitton
roger at dcs.bbk.ac.uk
Wed Apr 13 18:14:18 UTC 2011
A late response to Stefan's posting of 9 Apr.
The lack of a standard corpus for evaluating English spellcheckers may not be
all that surprising. Researchers focus on different aspects of spellchecking,
and a corpus appropriate for testing one piece of work may be almost useless for
testing another. Are we concentrating on detecting errors or can we take the
error as given and concentrate on suggesting corrections? Are we happy to ignore
real-word errors or, on the other hand, are they the focus of our research? Do
we want to tackle the sort of misspellings made by people who have difficulty
with spelling or are we correcting the occasional typo in otherwise correct
text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if
the errors are made by native speakers or second-language users of English? Is a
set of errors (with their targets) adequate or is it essential to have context?
If the latter, will a snippet of context do or do you need full documents? Do
we want to correct running text or queries to a search engine? And so on.
My own work has focussed on trying to correct the mangled efforts of poor
spellers. Years ago, I gathered various collections of misspellings and
deposited them, with some documentation, in the Oxford Text Archive, who
christened them the "Birkbeck error corpus". There is a file, derived from
these, for download from my website, along with a couple of others:
http://www.dcs.bbk.ac.uk/~roger/corpora.html
More recently, my colleague Jenny Pedler has compiled a file specifically of
real-word errors, in some context. This is also available for download:
http://www.dcs.bbk.ac.uk/~jenny/resources.html
Roger Mitton
Birkbeck, University of London
On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag
<sbordag at informatik.uni-leipzig.de> wrote:
> Hi everyone,
>
> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find any
> such test set for the spell checking task!
>
> Am I doing something wrong or is there no such data set? I know I can make
> synthetic tests systematically inserting, swapping etc. letters in my own
> test data, but this would give me results which I cannot compare to any
> other results. Hence, is there some accepted evaluation forum which I am
> missing because whenever I include spell check in any form in search queries
> I get lots of tutorials how to write a spellchecker and almost nothing
> else...
>
> Best regards,
> Stefan Bordag
>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list