Corpora: Collaborative effort

Wed Jun 14 14:09:29 UTC 2000

> In the case of Semcor and DSO, the sense inventory was the same (WordNet).
> The rate of agreement I mentioned was the agreement we would get by
> tagging all instances with the most frequent sense for the word in the
corpus.

As reported in our ACL SIGLEX99 workshop paper ("A Case Study on
Inter-Annotator Agreement for Word Sense Disambiguation", by Hwee Tou Ng,
Chung Yong Lim, and Shou King Foo), for the 30,315 sentences that are common
to both Semcor and the DSO corpus, the rate of inter-annotator agreement is
56.7%. Our calculation indicates that the most frequent senses (of the 191
words) in the intersection corpus of 30,315 Semcor sentences account for
53.2%.

However, part of the reason is that many of these 191 words have very skewed
sense distribution, such that the most frequent sense of a word accounts for
a large number of the word sense occurrences. If we restrict our attention
to half of these 191 words (61 nouns and 35 verbs) where the most frequent
sense occurs comparative less, then the Semcor-DSO agreement rate for these
61 nouns is 10% higher than the most frequent sense occurrence. And for the
35 verbs is 16% higher.

Another point to note is that the inter-annotator agreement rate has a lot
to do with the very refined sense distinction used in WordNet. As reported
in our SIGLEX99 paper, if we allow coarser sense classes, then the
inter-annotator agreement for a subset of 53 nouns and 42 verbs can be
higher than 93%.

Hwee Tou
DSO National Laboratories, Singapore