Corpora: ngram frequencies with intervening words?
Bruce Lambert
lambertb at uic.edu
Tue Apr 24 15:41:19 UTC 2001
Thanks to Lee Gilliam. Thanks also to Philip Resnik and Ted Pedersen, both
of whom pointed to Ted's bigram software:
http://www.d.umn.edu/~tpederse/code.html
Jens Enlund was kind enough to write his own Perl script to do the job. Not
exactly what I need, but darn close.
-bruce
--------------------
#!/usr/bin/perl -w
use strict;
# Get the words and the max allowed intervening words
#
my $w1 = shift @ARGV || die "Missing argument: WORD1\n";
my $w2 = shift @ARGV || die "Missing argument: WORD2\n";
my $n = shift @ARGV || die "Missing argument: N\n";
# globals
#
my (%res, $tot);
# read STDIN line by line
#
while (<>) {
# pattern match
while (s/\b($w1) +((\w+ ){0,$n}?)($w2)\b//) {
# Prettify a little
my $tmp = $2;
chop $tmp;
# Count up intervening words (if any) and total
$tmp ne "" && $res{$tmp}++;
$tot++;
}
}
# Print results (sloppy, will leave an extra blank before the newline)
#
print "$w1 $w2 ($tot) ";
foreach my $words (sort {$res{$a} <=> $res{$b}} keys %res) {
print "($res{$words} \"$words\") ";
}
print "\n";
exit;
More information about the Corpora
mailing list