[Corpora-List] Spellchecker evaluation corpus

Sat Apr 9 11:47:40 UTC 2011

On Sat, 9 Apr 2011, Stefan Bordag <sbordag at informatik.uni-leipzig.de> wrote:

> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find
> any such test set for the spell checking task!
>
> Am I doing something wrong or is there no such data set? ...

I don't know of one per se other than my own dyslexically-motivated
scribblings. Though I am heartened to see others have responded with a few
datasets. But which language are you considering? And equally in what
period? The orthography of English has changed over the years; there are
celebrations organised this year for the King James Version translation of
the Bible. Spelling rules then were different from today's. Similarly the
rules of 200 years ago in Jane Austen's time were different. As English
literature students should be familiar with both those items.

> ... I know I can make synthetic tests systematically inserting, swapping
> etc. letters in my own test data, but this would give me results which I
> cannot compare to any other results. ...

There are some perl/python/ruby scripts around to do those types of
transpositions. The frequency of such alterations might well be listed in
the text criticism literature. What are the observed errors in actual
usage, etc..

Regards, Trevor

<>< Re: deemed!

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora