Corpora: ngram frequencies with intervening words?

Bruce Lambert lambertb at uic.edu
Tue Apr 24 15:41:19 UTC 2001


Thanks to Lee Gilliam. Thanks also to Philip Resnik and Ted Pedersen, both
of whom pointed to Ted's bigram software:

http://www.d.umn.edu/~tpederse/code.html

Jens Enlund was kind enough to write his own Perl script to do the job. Not
exactly what I need, but darn close.

-bruce

--------------------

#!/usr/bin/perl -w

use strict;

# Get the words and the max allowed intervening words
#
my $w1 = shift @ARGV || die "Missing argument: WORD1\n";
my $w2 = shift @ARGV || die "Missing argument: WORD2\n";
my $n  = shift @ARGV || die "Missing argument: N\n";

# globals
#
my (%res, $tot);

# read STDIN line by line
#
while (<>) {
  # pattern match
  while (s/\b($w1) +((\w+ ){0,$n}?)($w2)\b//) {
     # Prettify a little
     my $tmp = $2;
     chop $tmp;
     # Count up intervening words (if any) and total
     $tmp ne "" && $res{$tmp}++;
     $tot++;
  }
}

# Print results (sloppy, will leave an extra blank before the newline)
#
print "$w1 $w2 ($tot) ";

foreach my $words (sort {$res{$a} <=> $res{$b}} keys %res) {
  print "($res{$words} \"$words\") ";
}

print "\n";

exit;



More information about the Corpora mailing list