Corpora: Collaborative venture

Jem Clear jem at cobuild.collins.co.uk
Tue Jun 13 11:23:20 UTC 2000


Re: the points raised by Eric Atwell (et al.) (see snippet below).


> >I agreed if the sense tags have completely different meaning. However,
> >the differences in meaning between tags may be in shades of meaning
> >rather than the crisp decision that they are or not same....

> ... I don't believe there is a clear, "self-evident" set of semantic
> tags. Semantic tagging could instead aim to annotate each word with
> a SET of semantic features, and "disambiguation" could aim to
> eliminate sematic features incompatible with context; this would
> allow for overlap and indeterminate sense-tagging. The set of
> semantic features for a word could be a bundle of semantic
> information, for example the lemma/root, subject-category code,
> selection restrictions, and meaning definition from LDOCE; instead
> of sense-tagging, if the aim was to eliminate features which were
> incompatible with context, you should get more inter-annotator
> agreement.


Oh dear! No, no, no. OK. Maybe I was being a little naive in
thinking that a large group of corpus linguists could even begin
to agree on a simple, but potentially useful, collaborative
scheme. A project in "semantic tagging" seems to my way of
thinking precisely what we do *not* need -- or rather we have
plenty of such projects going on at the moment anyway so there's
no widespread benefit to the linguistic community in having
a few more people sitting round discussing what exactly *are*
the set of primitive semantic components or how a semantic "entry"
should be structured or whatever.

I was feeling reckless last Friday afternoon so thought I'd float
an extremely simple idea based on the assumption that speakers
of English (native or non-native) have some ability to pick from
a number of offered citations those which in their opinion match
a given dictionary definition. I am not so foolish as to believe

a) that all respondents would select the same citations if offered the
same source set (this is the Consensus Issue)

b) that the dictionary definition is "true" or "correct" or clearly defines
the boundaries of a word sense (this is the Which Tagset? Issue)

c) that all citations selected by respondents would be "correct" (this
is the Quality Control Issue: aka the Noise Problem)

Suppose in primitive times, when the only routes connecting towns and
villages were rough, muddy tracks, that someone proposes that the
community build a road by bringing bucketloads of rubble, stones, ash,
whatever and pack it down to make a hard flat surface. As soon as this
idea is proposed, one group of villagers get very excited because
no-one has told them how wide the proposed road should be (just wide
enough for one cart -- or wide enough for two carts to pass?). A wise
man from another town questions whether straw should be added to the
stones being thrown down -- straw may disintegrate and not last
through winter rains. Others get into fierce arguments about whether
the road should go straight from one village to another or should wind
around avoiding hills, deep valleys, marshland, etc.

You get the idea! Just a few people bring along a few bucketloads of
stones and rubble and the road extends for no more than 5 metres,
despite the fact that almost everyone agrees that a road of some sort
would be much better than the rutted, filthy, muddy track along which
they have to walk, ride, or drive their livestock.

Linguistics is such fun, isn't it

Jem Clear

Electronic Development Director     phone:  +44 (0)121-414-3926
Collins Dictionaries                  fax:  +44 (0)121-414-6203
Westmere, 50 Edgbaston Park Road    email: jem at cobuild.collins.co.uk
Birmingham, B15 2RX, UK               WWW: www.cobuild.collins.co.uk



More information about the Corpora mailing list