Hello,<br><br>My name is Markus Saers, and I am currently implementing an anlignment tool as part of a course in Java for NLP. When trying to implement the Dice coefficient, I ran into some problems that I was hoping someone could help me with.
<br><br>The only definition of the Dice coefficient that I have seen looks like this:<br><br>Dice = 2 * p(ws, wt) / ( p(ws) + p(wt) )<br><br>Where p(ws, wt) is the probability of the source word co-occurring with the target word, p(ws) is the probability of the source word and p(wt) is the probability of the target word.
<br><br>Although it is stated as probabilities, some info that I gathered on the net seems to suggest that frequency count is used instead, which is problematic in word alignment since that would presuppose that Ns=Nt (where Ns is the number of source words and Nt is the number of target words).
<br><br>The second problem arise when probabilities ARE used. p(ws) and p(wt) are easy to estimate, but how is p(ws, wt) estimated?<br><br>Best regards<br>Markus Saers<br>PhD student, Uppsala University<br>