[Corpora-List] Summary: Studies of spelling error frequency in journalistic text
Raphael Mudge
raffi at automattic.com
Thu Dec 3 15:14:23 UTC 2009
Hi Nicola,
Roger Mitton has done a lot of work in the area of computer spell
checking. http://www.dcs.bbk.ac.uk/~roger/ His university also makes
available a corpus of misspelled words that may be useful to you. One
of his students also did a thesis on real-word errors. This is English
only though.
One of my favorite essays on spelling correction is: http://norvig.com/spell-correct.html
For spelling correction, I've created an open source system called
After the Deadline. It includes a call that will generate statistics
about the writing quality of a text you give to it [ see:http://
www.afterthedeadline.com/api.slp ]. It also does some real-word error
detection but this is based on trigrams and fixed confusion sets. You
can look at it athttp://open.afterthedeadline.com Again, this system
is English only at this time.
I also make available a package of English boostrap data that includes
text from public domain books and Wikipedia infused with spelling and
grammar errors taken from Wikipedia's list of commonly misspelled words.
For noisy texts, I recommend googling for a "Learner Corpus".
Best of luck.
-- Raphael
On Dec 3, 2009, at 9:41 AM, Nicola Bertoldi wrote:
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort
> (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the
> spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of
> different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>
>
>
> Thanks!
>
> Nicola
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091203/90759c9d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list