Alon,<br><br>I also found log-likelihood ratios to be useful. I would use second order measures, meaning, I would measure the LL between word A and each of its collocates, creating a vector containing these values; I'd do the same with word B; then compare the two vectors, simply with cosine or other known vector similarity function. See <br>
<br>Scott McDonald. 2000. Environmental determinants of lexical processing effort. Ph.D. thesis, University of<br>Edinburgh.<br><br>There are a few surveys on semantic distance measures, here's one, you might find helpful:<br>
<br>Julie Weeds, David Weir, and Diana McCarthy. 2004.<br>Characterising measures of lexical distributional similarity.<br>In Proceedings of the 20th International<br>Conference on Computational Linguistics (COLING),<br>pages 1015–1021, Geneva, Switzerland.<br>
<br>There newer measures, too:<br><br>Saif Mohammad and Graeme Hirst. 2006. Distributional measures of concept-distance: A task-oriented evaluation.In Proc. EMNLP, Sydney, Australia.<br><br>Yuval Marton, Saif Mohammad, and Philip Resnik. 2009b. Estimating semantic distance using soft semantic constraints in knowledge-source / corpus hybrid models. In Proc. EMNLP, Singapore.<br>
<br>and others.<br><br>I am not sure what you expect to get from statistical tests here. It is hardly likely that two words are completely synonymous in all contexts, so you are probably left with a scale or degrees of similarity. These are useful to compare pairs: is the word pair (A,B) closer in distribution, and hence presumably in meaning, than (A,C)?<br>
<br>Last note: I would suspect that a 1.5M token corpus might give you results that are so-so. If you can scale up to 100M and even an order of magnitude larger, I think you'd do better.<br><br><br>HTH,<br><br>-Yuval<br>
<br><br><br><br><div class="gmail_quote">On Thu, Oct 14, 2010 at 8:35 AM, Alon Lischinsky <span dir="ltr"><<a href="mailto:alon.lischinsky@kultmed.umu.se">alon.lischinsky@kultmed.umu.se</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Hi again.<br>
<br>
@Justin:<br>
<div class="im">> You still have the issue of how to compare these values. I would expect<br>
> that the best choice would be to take the Log-Likelihoods of the conditional<br>
> probabilities of each collocate term, between the two terms of interest.<br>
> That will give you a measure of significance which will take the marginal<br>
> frequencies of the collocate terms into account, and will therefore identify<br>
> any "suitably surprising" differences, in either direction (if you supply<br>
> a threshold).<br>
<br>
</div>I am not really sure I'm following you. I'm familiar with<br>
log-likelihood as a test for measuring the frequency of node-collocate<br>
pairs relative to a corpus of which these pairs are a subset, but what<br>
I'm trying to measure is the significance of the difference between<br>
the frequency of node A-collocate pairs and the frequency of node<br>
B-collocate pairs (where neither is a subset of the other). This might<br>
be my innumeracy speaking, of course, but I wouldn't know how to adapt<br>
the LL formula I know for this purpose.<br>
<br>
@Adam:<br>
<div class="im">> the problem is - the differences are extremely likely to be statistically<br>
> significant but that does not mean they are linguistically interesting<br>
<br>
</div>I'm aware of that (although I am not sure the statistical implications<br>
have entirely sunk in). However, in the case in point I'm comparing<br>
two terms that cannot be expected to differ for syntactic or semantic<br>
reasons, and are both relatively frequent (4384 vs 8991 in a<br>
1'500'000-token corpus). I simply want to test whether they are really<br>
in free variation (to borrow a term from phonetics) or whether they<br>
tend to vary according to the co-text. Finding whether there are<br>
statistically significant differences in their collocation patterns<br>
seemed a good way to start. I have observed a few differences that<br>
would seem to be linguistically interesting, and I'm trying to<br>
determine whether they are statistically significant as well, or<br>
merely explainable as an artifact of my data.<br>
<div class="im"><br>
> So you can't get an objective answer to the question 'is the difference<br>
> noteworthy' (at least not until we have a far better theory of corpora) but<br>
> there are some suggestions of the maths to support your analysis in Simple<br>
> Maths for Keywords (Proc. Corpus Linguistics, Liverpool 2009)<br>
<br>
</div>Thanks for the pointer. I've read the paper, but (if the version at<br>
<a href="http://www.kilgarriff.co.uk/Publications/2009-K-CLLiverpool-SimpleMaths.doc" target="_blank">http://www.kilgarriff.co.uk/Publications/2009-K-CLLiverpool-SimpleMaths.doc</a><br>
is the final one), it doesn't really cover my case, as I'm concerned<br>
with the differences between the collocates of two nodes within the<br>
same corpus.<br>
<font color="#888888"><br>
Alon<br>
</font><div><div></div><div class="h5"><br>
_______________________________________________<br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
</div></div></blockquote></div><br>