[Corpora-List] Machine Translation and Spelling Correction

Marcin Miłkowski list-address at wp.pl
Thu Dec 3 16:16:02 UTC 2009


Hi Nicola,
Nicola Bertoldi pisze:
> I send again this message with a more appropriate heading.
> Sorry for the inconvenience.
>
>
>
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>   
I'm afraid the available data are quite scarce for many languages. Look 
at Roger Mitton's website (http://www.dcs.bbk.ac.uk/~roger/) - he has 
several corpora for English and a pdf of his book on spell-checking.

For spell-checking, you can use ispell (a bit outdated), aspell 
(modern), or hunspell (good for complex compounding languages). Looking 
at autocorrect lists for OpenOffice.org might be also a nice place to 
look at. Wikipedias usually also have "frequent typos" page.

You might want to create your own corpora of spelling mistakes. 
Actually, this is quite easy, if you have several GBs of free space and 
a couple of days to process the history of Wikipedia. See my paper: 
http://marcinmilkowski.pl/downloads/error_corpora.pdf  - I have some 
hacky scripts but they are just a prototype written in AWK.

Regards,
Marcin

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list