[Corpora-List] question about Wordsmith tools (log-likelihood)
Stefan Evert
stefan.evert at uos.de
Fri Sep 22 21:31:51 UTC 2006
Dear Luciana,
calculating 'd' precisely for span-based collocations is a tricky
problem indeed, especially if you want to do it in a mathematically
sound way. I've tried to work this out in my PhD thesis (which you
can get from my homepage purl.org/stefan.evert - look under
"Publications"), but the description has become fairly technical and
complicated.
A reasonably good approximation is achieved by the following
procedure, which calculates the four entries of the contingency table
from the number of cooccurrences (a), the marginals (r1 = first row
and c1 = first column), and the sample size (n).
a = number of cooccurrences of w1 and w2 within the chosen span size
c1 = first column marginal = unigram frequency of w2
The next two values may be different from what you would do intuitively:
r1 = first row marginal = number of "slots" where w2 could cooccur
with w1 = span size * unigram frequency of w1
n = sample size = total number of tokens in the corpus (yes, you were
right, d will be close to the total number of tokens)
I think that the calculation of r1 merits some further explanation.
In your case, where a 1:1 span is used, there are two positions
around each instance of w1 where an instance of w2 could in principle
cooccur with it, so the total number of "slots" is 2 * f(w1). If you
increase the span size, the number of slots increases
correspondingly, so for a 3:3 span, it would be 6 * f(w1); for a one-
sided 0:5 span, it would be 5 * f(w1).
Once you've got all this information, it's straightforward to
calculate the contingency table:
a = a (as defined above)
b = r1 - a
c = c1 - a
d = n - r1 - c1 + a
Hope this helps to clarify things a little,
Stefan
PS: If you look closely at these equations, you'll notice that
changing the span size will also change d, but only by a
comparatively small amount. r1 and b, on the other hand, are much
more sensitive to span size.
On 20 Sep 2006, at 22:50, Luciana Diniz wrote:
> I'm trying to make sense of the log likelihood formula (in the
> Wordsmith
> Tools manual), and I'm not sure what "d" means in:
>
> "d := frequency of pairs involving neither w1 nor w2"
>
> Does it mean the frequency of the all possible collocates (with span
> 1:1) minus the frequency of the word 1 (isolated frequency) minus the
> frequency of word 2 (isolated frequency)?
> If this is the case, would "d" be very close to the total number of
> words in the corpus?
>
> Also, if this is the case, what if I choose a different span? Would
> this
> change the value of "d"?
>
> I'm very confused and I'd really appreciate it if somebody could
> help me
> :)
>
> Thank you!
> Luciana.
>
More information about the Corpora
mailing list