Corpora: ngram frequencies with intervening words?

Bruce Lambert lambertb at
Tue Apr 24 15:41:19 UTC 2001

Thanks to Lee Gilliam. Thanks also to Philip Resnik and Ted Pedersen, both
of whom pointed to Ted's bigram software:

Jens Enlund was kind enough to write his own Perl script to do the job. Not
exactly what I need, but darn close.



#!/usr/bin/perl -w

use strict;

# Get the words and the max allowed intervening words
my $w1 = shift @ARGV || die "Missing argument: WORD1\n";
my $w2 = shift @ARGV || die "Missing argument: WORD2\n";
my $n  = shift @ARGV || die "Missing argument: N\n";

# globals
my (%res, $tot);

# read STDIN line by line
while (<>) {
  # pattern match
  while (s/\b($w1) +((\w+ ){0,$n}?)($w2)\b//) {
     # Prettify a little
     my $tmp = $2;
     chop $tmp;
     # Count up intervening words (if any) and total
     $tmp ne "" && $res{$tmp}++;

# Print results (sloppy, will leave an extra blank before the newline)
print "$w1 $w2 ($tot) ";

foreach my $words (sort {$res{$a} <=> $res{$b}} keys %res) {
  print "($res{$words} \"$words\") ";

print "\n";


More information about the Corpora mailing list