[Corpora-List] Dice coefficient

Wed Apr 19 09:46:26 UTC 2006

Hi Markus,

You must be working on word alignment, but I am not sure if you are using sentence aligned corpora.

>that frequency count is used instead, which is problematic
>in word alignment since that would presuppose that Ns=Nt 

If you are using sentence-aligned corpora, you can get the frequencies for ws and wt by counting the aligned sentence pairs in which each of them occurs. In this case, Ns=Nt=total_number_of_aligned_sentence_pairs. As to the co-occurrence frequency for (ws, wt), you can get it by counting the aligned sentence pairs in which both of them occur.

If you are not using aligned corpora, you can substitute the aligned sentence pairs with certain corresponsing text segments, such as paragraphs or sections. 

Hope this helps.

Scott Piao
--------------------
Computing Department
Lancaster University
UK

-----Original Message-----
From: owner-corpora at lists.uib.no on behalf of Markus Saers
Sent: Wed 19/04/2006 08:50
To: CORPORA at uib.no
Subject: [Corpora-List] Dice coefficient

Hello,

My name is Markus Saers, and I am currently implementing an anlignment tool
as part of a course in Java for NLP. When trying to implement the Dice
coefficient, I ran into some problems that I was hoping someone could help
me with.

The only definition of the Dice coefficient that I have seen looks like
this:

Dice = 2 * p(ws, wt) / ( p(ws) + p(wt) )

Where p(ws, wt) is the probability of the source word co-occurring with the
target word, p(ws) is the probability of the source word and p(wt) is the
probability of the target word.

Although it is stated as probabilities, some info that I gathered on the net
seems to suggest that frequency count is used instead, which is problematic
in word alignment since that would presuppose that Ns=Nt (where Ns is the
number of source words and Nt is the number of target words).

The second problem arise when probabilities ARE used. p(ws) and p(wt) are
easy to estimate, but how is p(ws, wt) estimated?

Best regards
Markus Saers
PhD student, Uppsala University