Corpora: Negative mutual information?

Jem Clear jem at jemclear.co.uk
Fri Mar 9 09:23:00 UTC 2001


> I have puzzled a bit over this notion of being 'less than
> what would be expected under independence'. Does this just
> mean that the words in the bigram are independent, or is
> something further suggested? I'd be interested if anyone else
> has some thoughts on that particular issue...

I'm no statistician (failed my advanced maths as a schoolboy!), but I
have also thought about negative MI scores (i.e. 'less than what would
be expected under independence') and I find it quite intuitively
acceptable that if you look at the co-occurrence of any two words in a
corpus you might observeve one of the following three phenomena:

a) the two words occur more frequently together than under
independence (MI > 0): so the words have a tendency to "go together"
-- this *is* "collocation" as I see it.

b) the words occur together exactly the same number of times as we
would expect under the independence assumption (MI == 0): so what?
These words simply occur together because of their relative
frequencies in the corpus. No other factors influencing their
co-occurrence.

OR

c) the words occur less frequently together than under
independence (MI < 0): in which case these words have a tendency *not*
to occur together -- they "avoid" each other. Or rather
speakers/writers avoid putting them together. This is like
"anti-collocation" and it's a phenomenon very little studied as far as
I'm aware.

I used this negative<-->positive MI scale as a key to doing some
automatic word sense discrimination experiments some years ago. The
point was that if you calculated a set of **all** the words that
co-occurred with some chosen keyword and stored the MI value (be it
positive or negative), then one could use those MI values to weight
particular contexts in which the keyword appears (concordance lines!)
as "close to" (MI > 0)  or "far away" (MI < 0) from a given sense of
the keyword. That's not very clear, is it? It's not important
anyway. I just mean that negative MI is a) certainly possible, b)
consonant with our intuitive notions about how words go together in
text and c) quite useful!

Jem Clear

29 School Road, Moseley, Birmingham, B13 9TF, UK
Tel & Fax: +44 (0)121 689 3637
Email:     jem at jemclear.co.uk
Web:       www.englishexpert.co.uk



More information about the Corpora mailing list