[Corpora-List] question about Wordsmith tools (log-likelihood)

Stefan Evert stefan.evert at uos.de
Fri Sep 22 21:31:51 UTC 2006


Dear Luciana,

calculating 'd' precisely for span-based collocations is a tricky  
problem indeed, especially if you want to do it in a mathematically  
sound way.  I've tried to work this out in my PhD thesis (which you  
can get from my homepage purl.org/stefan.evert - look under  
"Publications"), but the description has become fairly technical and  
complicated.

A reasonably good approximation is achieved by the following  
procedure, which calculates the four entries of the contingency table  
from the number of cooccurrences (a), the marginals (r1 = first row  
and c1 = first column), and the sample size (n).

a = number of cooccurrences of w1 and w2 within the chosen span size
c1 = first column marginal = unigram frequency of w2

The next two values may be different from what you would do intuitively:

r1 = first row marginal = number of "slots" where w2 could cooccur  
with w1 = span size * unigram frequency of w1
n = sample size = total number of tokens in the corpus (yes, you were  
right, d will be close to the total number of tokens)

I think that the calculation of r1 merits some further explanation.  
In your case, where a 1:1 span is used, there are two positions  
around each instance of w1 where an instance of w2 could in principle  
cooccur with it, so the total number of "slots" is 2 * f(w1).  If you  
increase the span size, the number of slots increases  
correspondingly, so for a 3:3 span, it would be 6 * f(w1); for a one- 
sided 0:5 span, it would be 5 * f(w1).

Once you've got all this information, it's straightforward to  
calculate the contingency table:

a = a (as defined above)
b = r1 - a
c = c1 - a
d = n - r1 - c1 + a

Hope this helps to clarify things a little,
Stefan

PS: If you look closely at these equations, you'll notice that  
changing the span size will also change d, but only by a  
comparatively small amount.  r1 and b, on the other hand, are much  
more sensitive to span size.


On 20 Sep 2006, at 22:50, Luciana Diniz wrote:

> I'm trying to make sense of the log likelihood formula (in the  
> Wordsmith
> Tools manual), and I'm not sure what "d" means in:
>
> "d := frequency of pairs involving neither w1 nor w2"
>
> Does it mean the frequency of the all possible collocates (with span
> 1:1) minus the frequency of the word 1 (isolated frequency) minus the
> frequency of word 2 (isolated frequency)?
> If this is the case, would "d" be very close to the total number of
> words in the corpus?
>
> Also, if this is the case, what if I choose a different span? Would  
> this
> change the value of "d"?
>
> I'm very confused and I'd really appreciate it if somebody could  
> help me
> :)
>
> Thank you!
> Luciana.
>



More information about the Corpora mailing list