[Corpora-List] Machine Translation and Spelling Correction
Yuval Marton
ymarton at ccls.columbia.edu
Thu Dec 3 15:54:54 UTC 2009
Nicola,
There is quite a bit of work on spelling correction, using edit distance and other similarity measures.
One tool that is geared towards machine translation that comes to my mind right now is this:
(spelling correction is only one element of this tool)
Nizar Habash (2009). REMOOV: A Tool for Online Handling of Out-of-Vocabulary Words in Machine Translation. In Proc. the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt.
Any paraphrasing tool (for MT or otherwise) is likely to correct some spelling errors as well, although perhaps not with that as the primary focus.
As for available corpora with noisy text... any corpus will do :-)
It depends on the language and genre and type of errors you are interested in. Transcribed conversations (speech) presumably have more errors in them. Some might be automatically mis-corrected. Blogs and IM logs corpora might be a good place to start.
HTH,
-Yuval
On Thu, Dec 3, 2009 at 10:01 AM, Nicola Bertoldi <bertoldi at fbk.eu> wrote:
Nicola Bertoldi <bertoldi at fbk.eu>
to "corpora at uib.no" <corpora at uib.no>
date Thu, Dec 3, 2009 at 10:01 AM
subject [Corpora-List] Machine Translation and Spelling Correction
mailing list <corpora.uib.no> Filter messages from this mailing list
unsubscribe Unsubscribe from this mailing-list
hide details 10:01 AM (34 minutes ago)
I am going to do some investigation to improve machine translation
when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
In this preliminary phase I am collecting information about the spelling correction task
and other applications and tasks which involves spelling correction.
In particular, I am interested in
- surveys about the task
- statistics about the most common misspellings in texts of different languages and different genres
- public available software for spelling correction
- available corpora of noisy texts
- any further resources which is possibly useful for my topic
Thanks!
Nicola
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list