Corpora: ngram frequencies with intervening words?

Bruce Lambert lambertb at uic.edu
Mon Apr 23 20:41:24 UTC 2001


Greetings,

In the simplest case, when we compute ngram word frequencies, we consider
adjacent words as ngrams. But we may also want to know about pairs of words
that occur within n words of one another. Is there a program out there to
compute ngram frequencies allowing a variable-width window between the
words in the bigram? Ideally, the program would allow the user to rank the
bigrams not only by bigram frequency, but also by the frequency of the
intervening word patterns. For example, in a database of eighth grade
science lessons, the bigram "atom smallest" might occur several times in
different contexts. I'd like output approximately as follows:

atom smallest (3) (1 "was the") (2 "is the")

Indicating that the bigram "atom smallest" with window size 2 occurred 3
times total, once with the intervening words "was the" and twice with the
intervening words "is the".

I can think of a brute force way to do this myself, of course, but I'd
rather not reinvent the wheel if I can avoid it.

-bruce



More information about the Corpora mailing list