[Corpora-List] Frequency list of transformations
Dragomir Radev
radev at umich.edu
Fri Jan 21 14:29:29 UTC 2005
This is a bit tricky. There is no straighforward way to tell by
looking at a single pair like "occurence" and "occurrence" that the
second "r" in the latter is a substitute for a single "r" in the first
one. You should probably have a prior model of the types of
substitutions that are likely (e.g., doubling letters in this case).
A quick solution may involve using the standard diff algorithm.
Here is what I was able to put together in 10 minutes for you. This
code is in Perl and it uses a module (Algorithm::Diff) that you can
download from CPAN.
---------------------- mydiff.pl ------------------------
#!/usr/local/bin/perl
use Algorithm::Diff qw(diff);
@i1 = split '', shift;
@i2 = split '', shift;
$diffs = diff(\@i1, \@i2);
foreach $c (@$diffs) {
foreach $l (@$c) {
my ($sign, $n, $diff) = @$l;
printf "$sign$diff ";
}
print "\n";
}
---------------------------------------------------------
./mydiff.pl "heavie" "heavy"
-i +y -e
---------------------------------------------------------
Then you can use a shell script:
----------------------------------------------------------
cat file | perl -pe "print './mydiff.pl $_'" | sh > output
----------------------------------------------------------
Here is the output:
+r
-o +a
-m
-v +f
-i +y -e
+r
-v +f
You can further pipe it to
----------------------------------------------------
sort output | uniq -c | sort -nr | more
----------------------------------------------------
This will give you all substitutions in decreasing frequency:
----------------------------------------------------
2 -v +f
2 +r
1 -o +a
1 -m
1 -i +y -e
----------------------------------------------------
Drago
Marijke Koster wrote:
>
> Dear corpora list members,
>
> Does anyone have a suggestion for a simple method / a script to extract
> a frequency list of transformations from a list of spelling errors and
> corrections?
>
> For example here's this tab separated list:
>
> wrong correct
> ----- -------
> occurence occurrence
> occosion occasion
> commputer computer
> live life
> heavie heavy
> geat great
> save safe
>
> After applying the method it should result in something like this
> 1 rr -> r
> 1 a -> o
> 1 m -> mm
> 2 f -> v
> 1 y -> ie
> 1 r -> ()
>
> Thanks in advance,
> Marijke Koster
> ______________________________________
> Marijke Koster, linguistic engineer
> Polderland Language & Speech Technology BV
> The Netherlands
> http://www.polderland.nl
> Phone: +31.24.352 28 66
> Fax: +31.24.352 28 60
>
>
>
>
>
--
Dragomir R. Radev radev at umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
More information about the Corpora
mailing list