[Corpora-List] Measuring relative collocational strength

Thu Oct 14 12:35:41 UTC 2010

Hi again.

@Justin:
> You still have the issue of how to compare these values. I would expect
> that the best choice would be to take the Log-Likelihoods of the conditional
> probabilities of each collocate term, between the two terms of interest.
> That will give you a measure of significance which will take the marginal
> frequencies of the collocate terms into account, and will therefore identify
> any "suitably surprising" differences, in either direction (if you supply
> a threshold).

I am not really sure I'm following you. I'm familiar with
log-likelihood as a test for measuring the frequency of node-collocate
pairs relative to a corpus of which these pairs are a subset, but what
I'm trying to measure is the significance of the difference between
the frequency of node A-collocate pairs and the frequency of node
B-collocate pairs (where neither is a subset of the other). This might
be my innumeracy speaking, of course, but I wouldn't know how to adapt
the LL formula I know for this purpose.

@Adam:
> the problem is - the differences are extremely likely to be statistically
> significant but that does not mean they are linguistically interesting

I'm aware of that (although I am not sure the statistical implications
have entirely sunk in). However, in the case in point I'm comparing
two terms that cannot be expected to differ for syntactic or semantic
reasons, and are both relatively frequent (4384 vs 8991 in a
1'500'000-token corpus). I simply want to test whether they are really
in free variation (to borrow a term from phonetics) or whether they
tend to vary according to the co-text. Finding whether there are
statistically significant differences in their collocation patterns
seemed a good way to start. I have observed a few differences that
would seem to be linguistically interesting, and I'm trying to
determine whether they are statistically significant as well, or
merely explainable as an artifact of my data.

> So you can't get an objective answer to the question 'is the difference
> noteworthy' (at least not until we have a far better theory of corpora) but
> there are some suggestions of the maths to support your analysis in Simple
> Maths for Keywords (Proc. Corpus Linguistics, Liverpool 2009)

Thanks for the pointer.  I've read the paper, but (if the version at
http://www.kilgarriff.co.uk/Publications/2009-K-CLLiverpool-SimpleMaths.doc
is the final one), it doesn't really cover my case, as I'm concerned
with the differences between the collocates of two nodes within the
same corpus.

Alon

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora