Corpora: Negative mutual information?

Sat Mar 10 15:29:57 UTC 2001

In my 1994 dissertation I talk a little bit about words occurring
together less than what would be expected under independence. I would
certainly not claim that such pairs are independent. A revised version
was published in 1997 by Routledge as "Negative contexts. Collocation,
polarity, and multiple negation". Check the register under "negative
collocations".

Hope this helps,

Ton van der Wouden

In reaction to Jem Clear's reaction:

>
>
> > I have puzzled a bit over this notion of being 'less than
> > what would be expected under independence'. Does this just
> > mean that the words in the bigram are independent, or is
> > something further suggested? I'd be interested if anyone else
> > has some thoughts on that particular issue...
>
> I'm no statistician (failed my advanced maths as a schoolboy!), but I
> have also thought about negative MI scores (i.e. 'less than what would
> be expected under independence') and I find it quite intuitively
> acceptable that if you look at the co-occurrence of any two words in a
> corpus you might observeve one of the following three phenomena:
>
> a) the two words occur more frequently together than under
> independence (MI > 0): so the words have a tendency to "go together"
> -- this *is* "collocation" as I see it.
>
> b) the words occur together exactly the same number of times as we
> would expect under the independence assumption (MI == 0): so what?
> These words simply occur together because of their relative
> frequencies in the corpus. No other factors influencing their
> co-occurrence.
>
> OR
>
> c) the words occur less frequently together than under
> independence (MI < 0): in which case these words have a tendency *not*
> to occur together -- they "avoid" each other. Or rather
> speakers/writers avoid putting them together. This is like
> "anti-collocation" and it's a phenomenon very little studied as far as
> I'm aware.
>
> I used this negative<-->positive MI scale as a key to doing some
> automatic word sense discrimination experiments some years ago. The
> point was that if you calculated a set of **all** the words that
> co-occurred with some chosen keyword and stored the MI value (be it
> positive or negative), then one could use those MI values to weight
> particular contexts in which the keyword appears (concordance lines!)
> as "close to" (MI > 0)  or "far away" (MI < 0) from a given sense of
> the keyword. That's not very clear, is it? It's not important
> anyway. I just mean that negative MI is a) certainly possible, b)
> consonant with our intuitive notions about how words go together in
> text and c) quite useful!
>
> Jem Clear
>
> 29 School Road, Moseley, Birmingham, B13 9TF, UK
> Tel & Fax: +44 (0)121 689 3637
> Email:     jem at jemclear.co.uk
> Web:       www.englishexpert.co.uk
>
>
>

--
----------------------------------------------------------------------
Ton van der Wouden

1. VNC-Project "Partikelgebruik in Nederland en Vlaanderen"
2. Syntactische Annotatie Corpus Gesproken Nederlands
(http://lands.let.kun.nl/cgn/)

p/a ATW Leiden
Postbus 9515
2300 RA Leiden
tel. 071 5171089 (thuis) 071 5277983/030 2536172 (werk)
     071 5272615 (fax) 06 10836731 (m)
email vdwouden at let.rug.nl
http://www.let.rug.nl/~vdwouden

homepage Algemene Vereniging voor Taalwetenschap/Linguistic Society of
the Netherlands: http://www.let.rug.nl/orgs/avt

----------------------------------------------------------------------