[Corpora-List] Machine Translation and Spelling Correction

Martin Reynaert reynaert at uvt.nl
Thu Dec 3 15:38:15 UTC 2009


Dear Nicola,

I have been working on these issues for years. You will find statistics 
for English (based on Reuters RCV1) and Dutch (based on a selection of 
Dutch newspapers) in:

Corpus-Induced Corpus Cleanup

       Author(s): Martin Reynaert
       Reference: In Proceedings of the Fifth International Conference 
on Language Resources and Evaluation (LREC-06), Genoa, Italy, 2006.

That is available at http://ilk.uvt.nl.

I wish to eventually make these typo lists public, but have to sort out 
IPR matters first.

In younger papers you can also find statistics on OCR-errors in large 
digitized text collections.

My new system TICCL (Text-Induced Corpus Clean-up) will be made 
available as open-source as soon as I am ready extending and describing 
it, which I hope will be very soon ;0)

Yours,

Martin Reynaert
ILK
TiCC
University of Tilburg



Nicola Bertoldi wrote:
> I send again this message with a more appropriate heading.
> Sorry for the inconvenience.
> 
> 
> 
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
> 
> In this preliminary phase I am collecting information about the spelling correction task
> and other applications and tasks which involves spelling correction.
> 
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
> 
> 
> 
> Thanks!
> 
> Nicola
> 
> ------ End of Forwarded Message
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list