[Corpora-List] Spellchecker evaluation corpus

Stefan Bordag sbordag at informatik.uni-leipzig.de
Thu Apr 14 08:48:31 UTC 2011


Dear Antal,

You are right, I didn't think it through to the last consequence. But 
once you put it like this, perhaps producing such a corpus wouldn't be 
so difficult after all. Perhaps all it takes is a custom plugin for Open 
Office which people can use when they review documents they write in OO 
for errors. In this plugin, simply by klicking some accept button 
provided by the plugin they'd consent to have both the original version 
and the revised version sent to some database known to the plugin. With 
some time perhaps a sizeable collection of all sorts of corrections in 
all sorts of languages could be produced by this. I certainly wouldn't 
feel any difficulties with sending both the uncorrected and corrected 
version of my papers to such a database. After all, one more place 
they'd be sort of published. :)

Best regards,
Stefan

Am 14.04.2011 10:40, schrieb A.P.J. van den Bosch:
> Dear Stefan,
>
> All good points, but when you say
>
>> - several collections of misspelled words along with a defined context size of differing languages to evaluate spelling error detectors and correctors
> what do you mean with a defined context size? What seems to be missing from your list is what I think should be the ultimate evaluation setting: _full_ texts with _all_ errors annotated.
>
> Error list evaluations cannot measure the false alarm rate or precision of your spelling error detector: how often does it think it has found an error which isn't one? Put in another way, an algorithm with a great recall/accuracy on an error list may actually be an over-enthousiastic system that flags many normal words as errors as well.
>
> For fully-automatic correction and corpus cleanup this is quite vital - does your method do more harm than good? But also interactive spellcheckers could do with a higher precision; as one of the most widely used pieces of language technology worldwide, it's not particularly loved for its low precision.
>
>    Antal
>
> --
> Antal van den Bosch Antal.vdnBosch at uvt.nl http://ilk.uvt.nl/~antalb/
> ILK / Tilburg center for Cognition and Communication, Tilburg University
>
>
>


-- 
-------------------------------------------
- Dr. Stefan Bordag                       -
- 0341 49 26 196                          -
- sbordag at informatik.uni-leipzig.de       -
-------------------------------------------


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list