Corpora: Collaborative effort

Tue Jun 13 01:01:03 UTC 2000

Jeremy Clear wrote:

>... That's the crucial thing -- you spend no significant
>time agonizing over the task; you just quickly pick some concordance
>lines and send them in. Sure, not everyone will agree 100% that the
>lines you've picked exactly match the sense I posted (first because
>the sense I posted was just an arbitrary definition taken from one
>dictionary which is clearly inadequate to define and delimit precisely
>a semantic range; and second, because no-one is going to validate or

Philip Resnik wrote:

>I agree -- especially since tolerance of noise is necessary even when
>working with purportedly "quality controlled" data.  And one can
>always post-process to clean things up if quality becomes an issue

I don't mean to put a damper on this idea, but we should expect that
the agreement rate will be far from 100%.  Also, the tolerance of noise
will depend on the amount of noise.  I did a comparison between the
tagging of the Brown files in Semcor and the tagging done by DSO.
I found that the agreement rate was 56%.  This is exactly the rate of
agreement we would find by chance.  So the amount of post-processing
could be quite a bit of work!

Bob

krovetz at research.nj.nec.com