[Corpora-List] Summary: frequency list of transformations

Marijke Koster marijke at polderland.nl
Tue Jan 25 08:33:27 UTC 2005


Dear corpora list members,

Thank you all for your valuable contributions to my question. 

The suggestion of using the Levenshtein algorithm for this purpose has
been very valuable. The Levenshtein distance (LD) is a measure of
similarity between two strings, denoted here by s1 and s2. The distance
is the number of deletions, insertions or substitutions required to
transform s1 into s2. The greater the distance, the more different the
strings are. 
More information can be found at http://www.merriampark.com/ld.htm.
The Brew edit distance has also been suggested.

Some of you have sent me a ready-made script (using for example a
string-edit aligment and a standard diff algorithm) for extracting a
list of transformations, for which many thanks.

Some of you were interested in the list of spelling errors and
corrections. Please let me elaborate.
In cooperation with the Fryske Akademy, ("The Frisian Academy")
Polderland has developed the "Fryske TaalHelp" last year. The product is
a unique combination of a Frisian spellchecker and the electronic
version of a Frisian - Dutch dictionary, fully integrated in
Microsoft(r) Office.
We are now working on a children's version of the Fryske TaalHelp.
Suggestions offered by the spellchecker will be adapted to the
children's proficiency level. We have a set of texts written by Frisian
children (approximately 20,000 words) in which spelling errors are
tagged as such and in which the correction has been added. This list
gives us the opportunity to do some research on the sort of errors
children tend to make. The conclusions will be integrated in the
spellchecker's suggestion engine.
I unfortunately cannot share the list with you.

Thanks for all your help,
Marijke Koster
______________________________________
Marijke Koster, linguistic engineer
Polderland Language & Speech Technology BV
The Netherlands
http://www.polderland.nl
Phone: +31.24.352 28 66
Fax:   +31.24.352 28 60



More information about the Corpora mailing list