[Corpora-List] Frequency list of transformations
    Dragomir Radev 
    radev at umich.edu
       
    Fri Jan 21 14:29:29 UTC 2005
    
    
  
This is a bit tricky. There is no straighforward way to tell by
looking at a single pair like "occurence" and "occurrence" that the
second "r" in the latter is a substitute for a single "r" in the first
one. You should probably have a prior model of the types of
substitutions that are likely (e.g., doubling letters in this case).
A quick solution may involve using the standard diff algorithm.
Here is what I was able to put together in 10 minutes for you. This
code is in Perl and it uses a module (Algorithm::Diff) that you can
download from CPAN.
---------------------- mydiff.pl ------------------------
#!/usr/local/bin/perl
use Algorithm::Diff qw(diff);
@i1 = split '', shift;
@i2 = split '', shift;
$diffs = diff(\@i1, \@i2);
foreach $c (@$diffs) {
    foreach $l (@$c) {
        my ($sign, $n, $diff) = @$l;
        printf "$sign$diff ";
    }
    print "\n";
}
---------------------------------------------------------
./mydiff.pl "heavie" "heavy"
-i +y -e
---------------------------------------------------------
Then you can use a shell script:
----------------------------------------------------------
cat file | perl -pe "print './mydiff.pl $_'" | sh > output
----------------------------------------------------------
Here is the output:
+r
-o +a
-m
-v +f
-i +y -e
+r
-v +f
You can further pipe it to
----------------------------------------------------
sort output | uniq -c | sort -nr | more
----------------------------------------------------
This will give you all substitutions in decreasing frequency:
----------------------------------------------------
      2 -v +f
      2 +r
      1 -o +a
      1 -m
      1 -i +y -e
----------------------------------------------------
Drago
Marijke Koster wrote:
>
> Dear corpora list members,
>
> Does anyone have a suggestion for a simple method / a script to extract
> a frequency list of transformations from a list of spelling errors and
> corrections?
>
> For example here's this tab separated list:
>
> wrong      correct
> -----      -------
> occurence  occurrence
> occosion   occasion
> commputer  computer
> live       life
> heavie     heavy
> geat       great
> save       safe
>
> After applying the method it should result in something like this
> 1 rr -> r
> 1 a  -> o
> 1 m  -> mm
> 2 f  -> v
> 1 y  -> ie
> 1 r  -> ()
>
> Thanks in advance,
> Marijke Koster
> ______________________________________
> Marijke Koster, linguistic engineer
> Polderland Language & Speech Technology BV
> The Netherlands
> http://www.polderland.nl
> Phone: +31.24.352 28 66
> Fax:   +31.24.352 28 60
>
>
>
>
>
--
Dragomir R. Radev                                         radev at umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev
    
    
More information about the Corpora
mailing list