Corpora: Collaborative effort

Wible dwible at mail.tku.edu.tw
Tue Jun 13 09:08:21 UTC 2000


----- Original Message -----
From: Robert Luk (COMP staff) <csrluk at comp.polyu.edu.hk>
To: <krovetz at research.nj.nec.com>
Cc: <corpora at hd.uib.no>
Sent: Tuesday, June 13, 2000 9:43 AM
Subject: Re: Corpora: Collaborative effort

First, I thought this project was described not as a sense tagging project
but something like the reverse.  You give me a sense and I (one of the
collaborators) offer some sentences that illustrate that sense. It sounds
like concerns expressed below about agreement rates among taggers are
relevant to the latter but I don't quite see their relevance to the
collaborative work suggested for this project.  Maybe I'm missing something.

Even so, I have a thought about inter-tagger agreement when it comes to
sense tagging. I'm new to semantic tagging so please forgive me if my
thoughts are either old news or misguided.

Let's say there are 19 senses for the verb 'run'.  It seems to me misleading
to calculate inter-tagger agreement by assuming that any instance where two
taggers each select a different sense from these 19 constitutes an absence
of agreement between these two taggers.  These 19 or whatever senses are
certainly not discrete, autonomous senses as independent from one another as
pearls on a string.  For example, it may be that I can't even detect a
difference between, say, sense 7 and sense 8 or feel that the distinction is
a matter of splitting hairs (this is how I feel about certain sense
distinctions in WorNet for example).  In cases like these, the fact that
human tagger A chooses sense 7 and tagger B chooses sense 8 for a particular
token of 'run' is a very unimportant case of inter-tagger disagreement
compared to a case where A opts for sense 7 and B for sense 13, where 13 is
a clearly different from 7.

I don't know what sort of factors were taken into consideration in
calculating agreement rates in the cases mentioned below where figures
approach chance.  Certainly if agreement calculations ignore such semantic
clusterings among senses, we have to wonder about the value of such figures.
I realize that taggers also may not agree on the clusterings themselves, but
that is a different issue.  To the extent that we reject the pearls on a
string view of senses, to that extent we must admit that rater agreement
calculations which are based on the pearls view fall short of what they are
intended to measure.  Does anyone know if there has been any empirical
research done to uncover such clusterings of senses by asking raters to
judge similarity of senses?

Best,

David Wible

>
> > Jeremy Clear wrote:
> >
> > >... That's the crucial thing -- you spend no significant
> > >time agonizing over the task; you just quickly pick some concordance
> > >lines and send them in. Sure, not everyone will agree 100% that the
> > >lines you've picked exactly match the sense I posted (first because
> > >the sense I posted was just an arbitrary definition taken from one
> > >dictionary which is clearly inadequate to define and delimit precisely
> > >a semantic range; and second, because no-one is going to validate or
> >
> > Philip Resnik wrote:
> >
> > >I agree -- especially since tolerance of noise is necessary even when
> > >working with purportedly "quality controlled" data.  And one can
> > >always post-process to clean things up if quality becomes an issue
> >
> > Krovetz
> >
> > I don't mean to put a damper on this idea, but we should expect that
> > the agreement rate will be far from 100%.  Also, the tolerance of noise
> > will depend on the amount of noise.  I did a comparison between the
> > tagging of the Brown files in Semcor and the tagging done by DSO.
> > I found that the agreement rate was 56%.  This is exactly the rate of
> > agreement we would find by chance.  So the amount of post-processing
> > could be quite a bit of work!
>
> Consider that one has 6 sense tags and the other also has 6 sense tags for
the same
> word in a sentence, assuming that they use the same set of sense tags
> (although not likely). The likelihood that the two tagging
> algorithms agreed by chance (independently) is 6 x 1/6 x 1/6. So, the
> above seems to be true if there are 2 sense tags for the word:
>
> 2 x 1/2 x 1/2.
>
> Is this correct?
>
> For information, we did some work in measuring the agreement of sense
> tagging between HUMAN, which is about 80% for both recall and precision
> (or 0.8 x 0.8 = 0.64 ~ 0.56). However, this is for Chinese over a small
> sample.
>
> Best,
>
> Robert Luk
>



More information about the Corpora mailing list