[Corpora-List] Summary: Studies of spelling error frequency in journalistic text

Raphael Mudge raffi at automattic.com
Thu Dec 3 15:14:23 UTC 2009


Hi Nicola,

Roger Mitton has done a lot of work in the area of computer spell  
checking. http://www.dcs.bbk.ac.uk/~roger/ His university also makes  
available a corpus of misspelled words that may be useful to you. One  
of his students also did a thesis on real-word errors. This is English  
only though.

One of my favorite essays on spelling correction is: http://norvig.com/spell-correct.html

For spelling correction, I've created an open source system called  
After the Deadline. It includes a call that will generate statistics  
about the writing quality of a text you give to it [ see:http:// 
www.afterthedeadline.com/api.slp ]. It also does some real-word error  
detection but this is based on trigrams and fixed confusion sets. You  
can look at it athttp://open.afterthedeadline.com  Again, this system  
is English only at this time.

I also make available a package of English boostrap data that includes  
text from public domain books and Wikipedia infused with spelling and  
grammar errors taken from Wikipedia's list of commonly misspelled words.

For noisy texts, I recommend googling for a "Learner Corpus".

Best of luck.

-- Raphael

On Dec 3, 2009, at 9:41 AM, Nicola Bertoldi wrote:

> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort  
> (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the  
> spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of  
> different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>
>
>
> Thanks!
>
> Nicola
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20091203/90759c9d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list