Corpora: Cameron Smart's q about chi-squared test for bigrams

Geoffrey Sampson geoffs at cogs.susx.ac.uk
Thu Dec 20 11:45:29 UTC 2001


Cameron Smart's question is very interesting.  My immediate reaction is
that if one is interested in the frequency of a bigram (sequence of two
words), the comparison would be with other bigrams, i.e. all other pairings
of immediately successive words in the corpora.  The trouble is, though,
that the probabilities are not independent; if there is a case of
bigram X Y, then that makes it more likely that there will be a case of Y Z.
Is this the kind of failure of independence which can in practice be
ignored?  My feeling for statistics is not strong enough to give an answer.

G.R. Sampson, Professor of Natural Language Computing

School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, GB

e-mail geoffs at cogs.susx.ac.uk
tel. +44 1273 678525
fax  +44 1273 671320
web http://www.grsampson.net



More information about the Corpora mailing list