[Corpora-List] Measuring relative collocational strength

Thu Oct 14 13:13:28 UTC 2010

Alon,

I also found log-likelihood ratios to be useful. I would use second order
measures, meaning, I would measure the LL between word A and each of its
collocates, creating a vector containing these values; I'd do the same with
word B; then compare the two vectors, simply with cosine or other known
vector similarity function. See

Scott McDonald. 2000. Environmental determinants of lexical processing
effort. Ph.D. thesis, University of
Edinburgh.

There are a few surveys on semantic distance measures, here's one, you might
find helpful:

Julie Weeds, David Weir, and Diana McCarthy. 2004.
Characterising measures of lexical distributional similarity.
In Proceedings of the 20th International
Conference on Computational Linguistics (COLING),
pages 1015–1021, Geneva, Switzerland.

There newer measures, too:

Saif Mohammad and Graeme Hirst. 2006. Distributional measures of
concept-distance: A task-oriented evaluation.In Proc. EMNLP, Sydney,
Australia.

Yuval Marton, Saif Mohammad, and Philip Resnik. 2009b. Estimating semantic
distance using soft semantic constraints in knowledge-source / corpus hybrid
models. In Proc. EMNLP, Singapore.

and others.

I am not sure what you expect to get from statistical tests here. It is
hardly likely that two words are completely synonymous in all contexts, so
you are probably left with a scale or degrees of similarity. These are
useful to compare pairs: is the word pair (A,B) closer in distribution, and
hence presumably in meaning, than (A,C)?

Last note: I would suspect that a 1.5M token corpus might give you results
that are so-so. If you can scale up to 100M and even an order of magnitude
larger, I think you'd do better.

HTH,

-Yuval

On Thu, Oct 14, 2010 at 8:35 AM, Alon Lischinsky <
alon.lischinsky at kultmed.umu.se> wrote:

> Hi again.
>
> @Justin:
> > You still have the issue of how to compare these values. I would expect
> > that the best choice would be to take the Log-Likelihoods of the
> conditional
> > probabilities of each collocate term, between the two terms of interest.
> > That will give you a measure of significance which will take the marginal
> > frequencies of the collocate terms into account, and will therefore
> identify
> > any "suitably surprising" differences, in either direction (if you supply
> > a threshold).
>
> I am not really sure I'm following you. I'm familiar with
> log-likelihood as a test for measuring the frequency of node-collocate
> pairs relative to a corpus of which these pairs are a subset, but what
> I'm trying to measure is the significance of the difference between
> the frequency of node A-collocate pairs and the frequency of node
> B-collocate pairs (where neither is a subset of the other). This might
> be my innumeracy speaking, of course, but I wouldn't know how to adapt
> the LL formula I know for this purpose.
>
> @Adam:
> > the problem is - the differences are extremely likely to be statistically
> > significant but that does not mean they are linguistically interesting
>
> I'm aware of that (although I am not sure the statistical implications
> have entirely sunk in). However, in the case in point I'm comparing
> two terms that cannot be expected to differ for syntactic or semantic
> reasons, and are both relatively frequent (4384 vs 8991 in a
> 1'500'000-token corpus). I simply want to test whether they are really
> in free variation (to borrow a term from phonetics) or whether they
> tend to vary according to the co-text. Finding whether there are
> statistically significant differences in their collocation patterns
> seemed a good way to start. I have observed a few differences that
> would seem to be linguistically interesting, and I'm trying to
> determine whether they are statistically significant as well, or
> merely explainable as an artifact of my data.
>
> > So you can't get an objective answer to the question 'is the difference
> > noteworthy' (at least not until we have a far better theory of corpora)
> but
> > there are some suggestions of the maths to support your analysis
> in Simple
> > Maths for Keywords (Proc. Corpus Linguistics, Liverpool 2009)
>
> Thanks for the pointer.  I've read the paper, but (if the version at
> http://www.kilgarriff.co.uk/Publications/2009-K-CLLiverpool-SimpleMaths.doc
> is the final one), it doesn't really cover my case, as I'm concerned
> with the differences between the collocates of two nodes within the
> same corpus.
>
> Alon
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101014/f6420a55/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora