[Corpora-List] Machine Translation and Spelling Correction

Yuval Marton ymarton at ccls.columbia.edu
Thu Dec 3 15:54:54 UTC 2009


Nicola,

There is quite a bit of work on spelling correction, using edit distance and other similarity measures. 
One tool that is geared towards machine translation that comes to my mind right now is this:
(spelling correction is only one element of this tool)

Nizar Habash (2009). REMOOV: A Tool for Online Handling of Out-of-Vocabulary Words in Machine Translation.  In Proc. the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt. 

Any paraphrasing tool (for MT or otherwise) is likely to correct some spelling errors as well, although perhaps not with that as the primary focus.

As for available corpora with noisy text... any corpus will do :-)
It depends on the language and genre and type of errors you are interested in. Transcribed conversations (speech) presumably have more errors in them. Some might be automatically mis-corrected. Blogs and IM logs corpora might be a good place to start.

HTH,

-Yuval
 


On Thu, Dec 3, 2009 at 10:01 AM, Nicola Bertoldi <bertoldi at fbk.eu> wrote:
Nicola Bertoldi <bertoldi at fbk.eu>
to	"corpora at uib.no" <corpora at uib.no>
date	Thu, Dec 3, 2009 at 10:01 AM
subject	[Corpora-List] Machine Translation and Spelling Correction
mailing list	<corpora.uib.no> Filter messages from this mailing list
unsubscribe	Unsubscribe from this mailing-list
	
hide details 10:01 AM (34 minutes ago)
	

I am going to do some investigation to improve machine translation
when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).

In this preliminary phase I am collecting information about the spelling correction task
and other applications and tasks which involves spelling correction.

In particular, I am interested in
- surveys about the task
- statistics about the most common misspellings in texts of different languages and different genres
- public available software for spelling correction
- available corpora of noisy texts
- any further resources which is possibly useful for my topic



Thanks!

Nicola
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list