[Corpora-List] Machine Translation and Spelling Correction

Alistair Baron a.baron at comp.lancs.ac.uk
Thu Dec 3 16:58:39 UTC 2009


Dear Nicola,

We have been researching spelling variation, particularly in Early
Modern English corpora, see http://ucrel.lancs.ac.uk/VariantSpelling/
for a summary. This research so far has included evaluating the effect
of spelling variation on key word analysis (1), part-of-speech tagging
(2) and semantic analysis (3).

Two tools have also been produced: VARD 2
(http://www.comp.lancs.ac.uk/~barona/vard2/), a freely available (for
academic use) Java program for manually and automatically
standardising spelling, which learns how to deal with the variety of
spelling variation in a given corpus (4); and DICER
(http://corpora.lancs.ac.uk/dicer/) which extracts letter replacement
rules from standardised spellings and builds a detailed database of
these spelling rules and their frequencies - this can be used to
analyse spelling trends in a corpus or create a rule set for VARD 2 to
use for future standardisation.

Best,
Alistair

(1). Baron, A., Rayson, P. and Archer, D. (2009). Word frequency and
key word statistics in historical corpus linguistics. In Ahrens, R.
and Antor, H. (eds) Anglistik: International Journal of English
Studies, 20 (1), pp. 41-67.

(2). Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N.
(2007). Tagging the Bard: Evaluating the accuracy of a modern POS
tagger on Early Modern English corpora. In Davies, M., Rayson, P.,
Hunston, S. and Danielsson, P. (eds.) Proceedings of the Corpus
Linguistics Conference: CL2007, University of Birmingham, UK, 27-30
July 2007.

(3). Archer, D., McEnery, T., Rayson, P., Hardie, A. (2003).
Developing an automated semantic analysis system for Early Modern
English. In Proceedings of the Corpus Linguistics 2003 conference.
UCREL technical paper number 16. UCREL, Lancaster University, pp. 22 -
31.

(4). Baron, A. and Rayson, P. (2009). Automatic standardization
of texts containing spelling variation, how much training data do you
need? In proceedings of Corpus Linguistics 2009, University of Liverpool,
UK, 20-23 July 2009.

2009/12/3 Nicola Bertoldi <bertoldi at fbk.eu>:
>
> I am going to do some investigation to improve machine translation
> when it is applied to texts corrupted by misspellings of any sort (non-word, real-word errors).
>
> In this preliminary phase I am collecting information about the spelling correction task
> and other applications and tasks which involves spelling correction.
>
> In particular, I am interested in
> - surveys about the task
> - statistics about the most common misspellings in texts of different languages and different genres
> - public available software for spelling correction
> - available corpora of noisy texts
> - any further resources which is possibly useful for my topic
>
>
>
> Thanks!
>
> Nicola
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>



-- 
Alistair Baron
C28, Computing Department,
Infolab 21, South Drive,
Lancaster University,
Lancaster,
LA1 4WA
+44 1524 510348

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list