[Corpora-List] Dice coefficient

Markus Saers masaers at gmail.com
Wed Apr 19 07:50:07 UTC 2006


Hello,

My name is Markus Saers, and I am currently implementing an anlignment tool
as part of a course in Java for NLP. When trying to implement the Dice
coefficient, I ran into some problems that I was hoping someone could help
me with.

The only definition of the Dice coefficient that I have seen looks like
this:

Dice = 2 * p(ws, wt) / ( p(ws) + p(wt) )

Where p(ws, wt) is the probability of the source word co-occurring with the
target word, p(ws) is the probability of the source word and p(wt) is the
probability of the target word.

Although it is stated as probabilities, some info that I gathered on the net
seems to suggest that frequency count is used instead, which is problematic
in word alignment since that would presuppose that Ns=Nt (where Ns is the
number of source words and Nt is the number of target words).

The second problem arise when probabilities ARE used. p(ws) and p(wt) are
easy to estimate, but how is p(ws, wt) estimated?

Best regards
Markus Saers
PhD student, Uppsala University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060419/e1de2210/attachment.htm>


More information about the Corpora mailing list