[Corpora-List] Machine Translation and Spelling Correction

Thu Dec 3 16:27:29 UTC 2009

Ah, yes, if you don't care about the corrected text, creating a noisy 
corpus is trivially easy, as Yuval pointed out. I was speaking of an 
error corpus that would contain corrections as well.

Best,
Marcin

Yuval Marton pisze:
> Nicola,
>
> There is quite a bit of work on spelling correction, using edit distance and other similarity measures. 
> One tool that is geared towards machine translation that comes to my mind right now is this:
> (spelling correction is only one element of this tool)
>
> Nizar Habash (2009). REMOOV: A Tool for Online Handling of Out-of-Vocabulary Words in Machine Translation.  In Proc. the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt. 
>
> Any paraphrasing tool (for MT or otherwise) is likely to correct some spelling errors as well, although perhaps not with that as the primary focus.
>
> As for available corpora with noisy text... any corpus will do :-)
> It depends on the language and genre and type of errors you are interested in. Transcribed conversations (speech) presumably have more errors in them. Some might be automatically mis-corrected. Blogs and IM logs corpora might be a good place to start.
>
> HTH,
>
> -Yuval
>  
>
>
> On Thu, Dec 3, 2009 at 10:01 AM, Nicola Bertoldi <bertoldi at fbk.eu> wrote:
> Nicola Bertoldi <bertoldi at fbk.eu>
> to	"corpora at uib.no" <corpora at uib.no>
> date	Thu, Dec 3, 2009 at 10:01 AM
> subject	[Corpora-List] Machine Translation and Spelling Correction
> mailing list	<corpora.uib.no> Filter messages from this mailing list
> unsubscribe	Unsubscribe from this mailing-list
> 	
> hide details 10:01 AM (34 minutes ago)
> 	
>
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>
>
>
> Thanks!
>
> Nicola
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>   

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora