[Corpora-List] Spellchecker evaluation corpus

Yannick Versley versley at sfs.uni-tuebingen.de
Sat Apr 9 09:41:24 UTC 2011


Stefan,

The TüBa-D/Z treebank maintains the original spelling for the normal tokens
and
annotates spelling corrections in the comment field. This means that it can
be used
to train/test spell checkers (with a suitable split), and that the
distribution
of errors corresponds perfectly to the actual error rate in edited newspaper
text.
(It's less typical of the careless writing that you'll find in
user-contributed web content,
though).

Best,
Yannick


On Sat, Apr 9, 2011 at 10:45 AM, Stefan Bordag <
sbordag at informatik.uni-leipzig.de> wrote:

> Hi everyone,
>
> It seems like for every conceivable NLP task there is some agreed-upon
> evaluation data set. Or at least one that is used in at least several
> papers. Now, for some strange reason I seem to be utterly unable to find any
> such test set for the spell checking task!
>
> Am I doing something wrong or is there no such data set? I know I can make
> synthetic tests systematically inserting, swapping etc. letters in my own
> test data, but this would give me results which I cannot compare to any
> other results. Hence, is there some accepted evaluation forum which I am
> missing because whenever I include spell check in any form in search queries
> I get lots of tutorials how to write a spellchecker and almost nothing
> else...
>
> Best regards,
> Stefan Bordag
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110409/567a8579/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list