[Corpora-List] Machine Translation and Spelling Correction
Martin Reynaert
reynaert at uvt.nl
Thu Dec 3 15:38:15 UTC 2009
Dear Nicola,
I have been working on these issues for years. You will find statistics
for English (based on Reuters RCV1) and Dutch (based on a selection of
Dutch newspapers) in:
Corpus-Induced Corpus Cleanup
Author(s): Martin Reynaert
Reference: In Proceedings of the Fifth International Conference
on Language Resources and Evaluation (LREC-06), Genoa, Italy, 2006.
That is available at http://ilk.uvt.nl.
I wish to eventually make these typo lists public, but have to sort out
IPR matters first.
In younger papers you can also find statistics on OCR-errors in large
digitized text collections.
My new system TICCL (Text-Induced Corpus Clean-up) will be made
available as open-source as soon as I am ready extending and describing
it, which I hope will be very soon ;0)
Yours,
Martin Reynaert
ILK
TiCC
University of Tilburg
Nicola Bertoldi wrote:
> I send again this message with a more appropriate heading.
> Sorry for the inconvenience.
>
>
>
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>
>
>
> Thanks!
>
> Nicola
>
> ------ End of Forwarded Message
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list