[Corpora-List] Machine Translation and Spelling Correction
Marcin Miłkowski
list-address at wp.pl
Thu Dec 3 16:16:02 UTC 2009
Hi Nicola,
Nicola Bertoldi pisze:
> I send again this message with a more appropriate heading.
> Sorry for the inconvenience.
>
>
>
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>
I'm afraid the available data are quite scarce for many languages. Look
at Roger Mitton's website (http://www.dcs.bbk.ac.uk/~roger/) - he has
several corpora for English and a pdf of his book on spell-checking.
For spell-checking, you can use ispell (a bit outdated), aspell
(modern), or hunspell (good for complex compounding languages). Looking
at autocorrect lists for OpenOffice.org might be also a nice place to
look at. Wikipedias usually also have "frequent typos" page.
You might want to create your own corpora of spelling mistakes.
Actually, this is quite easy, if you have several GBs of free space and
a couple of days to process the history of Wikipedia. See my paper:
http://marcinmilkowski.pl/downloads/error_corpora.pdf - I have some
hacky scripts but they are just a prototype written in AWK.
Regards,
Marcin
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list