[Corpora-List] Measuring relative collocational strength

Justin Washtell lec3jrw at leeds.ac.uk
Wed Oct 13 14:36:33 UTC 2010


Hi Alon,

MI should work fine in that setting, providing the frequencies of the terms or their collocates aren't so low as to make the results undependable. However, if "spud" and "potato" are being studied in the same corpus, and therefore the marginal probabilities of the collocate terms do not vary between the two terms, then you do not need to use MI: conditional probability is probably adequate.

You still have the issue of how to compare these values. I would expect that the best choice would be to take the Log-Likelihoods of the conditional probabilities of each collocate term, between the two terms of interest. That will give you a measure of significance which will take the marginal frequencies of the collocate terms into account, and will therefore identify any "suitably surprising" differences, in either direction (if you supply a threshold).

I'm by no means an expert on these measures, so I should get a second opinion first, but this seems sensible to me. Unfortunately I cannot recommend the best software to use for this. I expect there are quite a few options.

Justin Washtell
University of Leeds

________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Alon Lischinsky [alon.lischinsky at kultmed.umu.se]
Sent: 13 October 2010 15:01
To: Corpora Mailing List
Subject: [Corpora-List] Measuring relative collocational strength

Hi.

I am looking for help with a kind of statistical measure that has
probably been described in the literature, but which I don't know how
to call. I should point out that I'm relatively new to corpus studies,
having a background in qualitative discourse studies, and am still
coming to terms with some of the technical lexis.

Simply put, I want to find out, given two terms that are seemingly
synonymous but different in absolute frequency (say, "potato" and
"spud"), which (lexical) terms have statistically significant
differences in their collocation with either. I suppose I could simply
look at the full list of collocates for each term ordered by t-score
or MI and spot differences, but since one of the terms is much rarer
and MI scores are affected by absolute frequency, I guess this might
lead to quite a few artifacts.

I don't know of any piece of software that can do that, so I would
appreciate any pointers, or even suggestions as to how to go about
doing it in R or any other statistical software (my programming skills
aren't great, but I trust I could manage with a little guidance).

Best,

Alon Lischinsky

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list