[Corpora-List] Spellchecker evaluation corpus

Thu Apr 14 06:42:15 UTC 2011

Hi Roger,

Thanks for this valuable input.

I imagine, however, that it wouldn't be conceptually difficult to set up 
a test that covers most or all of these needs you mentioned. A proper 
evaluation setup for spellchecking in general would consist of:
- several collections of misspelled words along with a defined context 
size of differing languages to evaluate spelling error detectors and 
correctors
- collections differing by source of error: errors made by dyslectics 
are different from errors introduced by OCR which are in turn different 
from errors introduced by writing SMS, which are different from errors 
introduced by beginners trying to write scientific papers, etc.
- several collections of string pairs (wrong to correct) in several 
different languages to evaluate context-free spelling correction 
algorithms (though the previous collections could be used for that as well)
- it should distinguish between spell checkers that need training data 
to learn to properly detect or correct errors and those that don't need 
any explicit training data (such as the Lucene spell checker)
- it should also take resource usage into account - the Lucene spell 
checker is much more memory resource intensive compared to a simple edit 
distance searcher, which, however, might use more CPU time.
- the different languages covered should contain languages from 
different language families, as well as contain languages with 
non-concatenative morphology or languages such as Chinese.

I would wager that once such a rounded collections of different aspects 
has been made it would very well be possible to generate absolute 
statements about which algorithm covers which areas to which extent.

Additionally, with today's abundance of internet bandwidth and CPU 
resources, it shouldn't be difficult to set up an evaluation webservice 
which allows the author of some new algorithm to test it against the 
webservice. This way the evaluation instance wouldn't even have to make 
the data freely available as such. Not all of it, anyway. Quite similar 
to the microsoft spell checker competition which has been mentioned 
here, but without the legal regulations that make them the owner of your 
algorithm once you want to participate...

A very similar approach taken by the Morpho Challenge [1] has helped to 
discover (among many other things) that some algorithms, while producing 
excellent results in English, might really fail in Turkish, for example.

Best regards,
Stefan

[1] http://research.ics.tkk.fi/events/morphochallenge2010/

Am 13.04.2011 20:14, schrieb Roger Mitton:
> A late response to Stefan's posting of 9 Apr.
>
> The lack of a standard corpus for evaluating English spellcheckers may not be
> all that surprising.  Researchers focus on different aspects of spellchecking,
> and a corpus appropriate for testing one piece of work may be almost useless for
> testing another. Are we concentrating on detecting errors or can we take the
> error as given and concentrate on suggesting corrections? Are we happy to ignore
> real-word errors or, on the other hand, are they the focus of our research? Do
> we want to tackle the sort of misspellings made by people who have difficulty
> with spelling or are we correcting the occasional typo in otherwise correct
> text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if
> the errors are made by native speakers or second-language users of English? Is a
> set of errors (with their targets) adequate or is it essential to have context?
> If the latter, will a snippet of context do or do you need full documents?  Do
> we want to correct running text or queries to a search engine?  And so on.
>
> My own work has focussed on trying to correct the mangled efforts of poor
> spellers. Years ago, I gathered various collections of misspellings and
> deposited them, with some documentation, in the Oxford Text Archive, who
> christened them the "Birkbeck error corpus". There is a file, derived from
> these, for download from my website, along with a couple of others:
>
> http://www.dcs.bbk.ac.uk/~roger/corpora.html
>
> More recently, my colleague Jenny Pedler has compiled a file specifically of
> real-word errors, in some context. This is also available for download:
>
> http://www.dcs.bbk.ac.uk/~jenny/resources.html
>
> Roger Mitton
> Birkbeck, University of London
>
> On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag
> <sbordag at informatik.uni-leipzig.de>  wrote:
>
>

-- 
-------------------------------------------
- Dr. Stefan Bordag                       -
- 0341 49 26 196                          -
- sbordag at informatik.uni-leipzig.de       -
-------------------------------------------

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora