Corpora: Collaborative effort

Jem Clear jem at cobuild.collins.co.uk
Fri Jun 9 14:54:39 UTC 2000


Dear Corpora-people

I've just had an idea for a collaborative venture that might benefit
the whole language-research community.

Suppose I were to post on the CORPORA list a word plus its definition
as taken from some good dictionary. Then I invite all and any
CORPORA list members to select from whatever corpora they may have
accessible to them one or more instances of an authentic context
in which the word is used with the given sense. I invite anyone to
email their selected examples to an email address which simply
files the examples away under the heading of the word/sense combination.
Easy, eh? How many examples might I expect to get? One hundred? Or
just ten? Or maybe 1,000? I think I might get a hundred or more.

Now of course I could continue to post different word/sense pairs
and invite anyone anywhere to contribute some examples taken from
real text. This might grow into a database of corpus data sorted into
sense categories (as delimited by the chosen dictionary).

We could then share that growing resource freely, gratis, in the
public domain -- since the cost of building it would be spread so
widely that no-one would have incurred any significant costs.
Moreover the range and variety of corpora which would in some
small measure contribute to the database could be very extensive
indeed and offer thereby a comprehensiveness not achievable by
the use of any single corpus (such as the British National Corpus,
or the Bank of English). There would be no copyright problem involved
in disseminating this database freely, since no text source is being
reproduced beyond the 20 or 30 words of context necessary to
illustrate the context of the word in its particular sense.

My preliminary investigations of the Cobuild English Dictionary show
that the number of lexemes (lemmas, headwords, whatever term you like
to use) having more than one dictionary sense within the same
part-of-speech class is measured in the low thousands.
(e.g. instances of "exhaust" can be divided into verb and noun uses
and each will have a different sense.) So it may take only a few years
to compile a database giving hundreds of corpus instances of each
sense of most polysemous words of English. The potential for
exploiting such a database in information retrieval, machine
translation, etc etc is clearly significant.

This scheme would be a truly collaborative effort, driven entirely
through the co-operation of the corpus linguistics community. This
might be more effective and more comprehensive than an EU-funded
project with participants from numerous EU member states, where
restrictions over proprietary software, or data, or copyright
concerns, or the multiple layers of bureaucratic administration tend
to hold back the process of compiling and disseminating useful
information.

Just for starters, here's a definition for the word "fierce" followed
by some illustrative examples:

    Fierce feelings or actions are very intense or enthusiastic, or
    involve great activity.

    Ex: A fierce battle has been raging all day in the Croatian town
        of Pakrac
    Ex: He inspires fierce loyalty in his friends.

Send any examples you can find (from any corpora to which you have
access) to "jem at cobuild.collins.co.uk".

Jem Clear

Electronic Development Director     phone:  +44 (0)121-414-3926
Collins Dictionaries                  fax:  +44 (0)121-414-6203
Westmere, 50 Edgbaston Park Road    email: jem at cobuild.collins.co.uk
Birmingham, B15 2RX, UK               WWW: www.cobuild.collins.co.uk



More information about the Corpora mailing list