[Corpora-List] Spellchecker evaluation corpus
Stefan Bordag
sbordag at informatik.uni-leipzig.de
Thu Apr 14 06:42:15 UTC 2011
Hi Roger,
Thanks for this valuable input.
I imagine, however, that it wouldn't be conceptually difficult to set up
a test that covers most or all of these needs you mentioned. A proper
evaluation setup for spellchecking in general would consist of:
- several collections of misspelled words along with a defined context
size of differing languages to evaluate spelling error detectors and
correctors
- collections differing by source of error: errors made by dyslectics
are different from errors introduced by OCR which are in turn different
from errors introduced by writing SMS, which are different from errors
introduced by beginners trying to write scientific papers, etc.
- several collections of string pairs (wrong to correct) in several
different languages to evaluate context-free spelling correction
algorithms (though the previous collections could be used for that as well)
- it should distinguish between spell checkers that need training data
to learn to properly detect or correct errors and those that don't need
any explicit training data (such as the Lucene spell checker)
- it should also take resource usage into account - the Lucene spell
checker is much more memory resource intensive compared to a simple edit
distance searcher, which, however, might use more CPU time.
- the different languages covered should contain languages from
different language families, as well as contain languages with
non-concatenative morphology or languages such as Chinese.
I would wager that once such a rounded collections of different aspects
has been made it would very well be possible to generate absolute
statements about which algorithm covers which areas to which extent.
Additionally, with today's abundance of internet bandwidth and CPU
resources, it shouldn't be difficult to set up an evaluation webservice
which allows the author of some new algorithm to test it against the
webservice. This way the evaluation instance wouldn't even have to make
the data freely available as such. Not all of it, anyway. Quite similar
to the microsoft spell checker competition which has been mentioned
here, but without the legal regulations that make them the owner of your
algorithm once you want to participate...
A very similar approach taken by the Morpho Challenge [1] has helped to
discover (among many other things) that some algorithms, while producing
excellent results in English, might really fail in Turkish, for example.
Best regards,
Stefan
[1] http://research.ics.tkk.fi/events/morphochallenge2010/
Am 13.04.2011 20:14, schrieb Roger Mitton:
> A late response to Stefan's posting of 9 Apr.
>
> The lack of a standard corpus for evaluating English spellcheckers may not be
> all that surprising. Researchers focus on different aspects of spellchecking,
> and a corpus appropriate for testing one piece of work may be almost useless for
> testing another. Are we concentrating on detecting errors or can we take the
> error as given and concentrate on suggesting corrections? Are we happy to ignore
> real-word errors or, on the other hand, are they the focus of our research? Do
> we want to tackle the sort of misspellings made by people who have difficulty
> with spelling or are we correcting the occasional typo in otherwise correct
> text? Are we interested, perhaps exclusively, in OCR errors? Does it matter if
> the errors are made by native speakers or second-language users of English? Is a
> set of errors (with their targets) adequate or is it essential to have context?
> If the latter, will a snippet of context do or do you need full documents? Do
> we want to correct running text or queries to a search engine? And so on.
>
> My own work has focussed on trying to correct the mangled efforts of poor
> spellers. Years ago, I gathered various collections of misspellings and
> deposited them, with some documentation, in the Oxford Text Archive, who
> christened them the "Birkbeck error corpus". There is a file, derived from
> these, for download from my website, along with a couple of others:
>
> http://www.dcs.bbk.ac.uk/~roger/corpora.html
>
> More recently, my colleague Jenny Pedler has compiled a file specifically of
> real-word errors, in some context. This is also available for download:
>
> http://www.dcs.bbk.ac.uk/~jenny/resources.html
>
> Roger Mitton
> Birkbeck, University of London
>
> On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag
> <sbordag at informatik.uni-leipzig.de> wrote:
>
>
--
-------------------------------------------
- Dr. Stefan Bordag -
- 0341 49 26 196 -
- sbordag at informatik.uni-leipzig.de -
-------------------------------------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list