[Corpora-List] Frequency list of transformations

Dragomir Radev radev at umich.edu
Fri Jan 21 18:03:21 UTC 2005


This is a bit tricky. There is no straighforward way to tell by
looking at a single pair like "occurence" and "occurrence" that the
second "r" in the latter is a substitute for a single "r" in the first
one. You should probably have a prior model of the types of
substitutions that are likely (e.g., doubling letters in this case).

A quick solution may involve using the standard diff algorithm.

Here is what I was able to put together in 10 minutes for you. This
code is in Perl and it uses a module (Algorithm::Diff) that you can
download from CPAN.

---------------------- mydiff.pl ------------------------
#!/usr/local/bin/perl

use Algorithm::Diff qw(diff);

@i1 = split '', shift;
@i2 = split '', shift;

$diffs = diff(\@i1, \@i2);
foreach $c (@$diffs) {
    foreach $l (@$c) {
        my ($sign, $n, $diff) = @$l;
        printf "$sign$diff ";
    }
    print "\n";
}
---------------------------------------------------------
./mydiff.pl "heavie" "heavy"
-i +y -e
---------------------------------------------------------

Then you can use a shell script:

----------------------------------------------------------
cat file | perl -pe "print './mydiff.pl $_'" | sh > output
----------------------------------------------------------

Here is the output:

+r
-o +a
-m
-v +f
-i +y -e
+r
-v +f

You can further pipe it to
----------------------------------------------------
sort output | uniq -c | sort -nr | more
----------------------------------------------------

This will give you all substitutions in decreasing frequency:

----------------------------------------------------
      2 -v +f
      2 +r
      1 -o +a
      1 -m
      1 -i +y -e
----------------------------------------------------

Drago

Marijke Koster wrote:
>
> Dear corpora list members,
>
> Does anyone have a suggestion for a simple method / a script to extract
> a frequency list of transformations from a list of spelling errors and
> corrections?
>
> For example here's this tab separated list:
>
> wrong      correct
> -----      -------
> occurence  occurrence
> occosion   occasion
> commputer  computer
> live       life
> heavie     heavy
> geat       great
> save       safe
>
> After applying the method it should result in something like this
> 1 rr -> r
> 1 a  -> o
> 1 m  -> mm
> 2 f  -> v
> 1 y  -> ie
> 1 r  -> ()
>
> Thanks in advance,



More information about the Corpora mailing list