Corpora: Question about a Brown Corpus tag

Thu Sep 14 14:32:21 UTC 2000

Mark Lewellen asked:

>An alternative to underspecification of POS information is to develop a
>POS tagger that records multiple POS in ambiguous contexts (ideally with
>probabilities attached to each POS choice).  An advantage to this
>approach is that POS-ambiguity information is not 'hard-coded' in advance
>by the tag set, but is rather determined by sentence context, and may be
>extended to other ambiguities (such as N vs. V).
>
>Could anyone point out projects that have developed such POS taggers, or
>submit opinions as to their viability?  One difficulty I notice is that a
>typical tagger using an HMM with the Viterbi algorithm determines a most
>likely _sequence_ , which would make it difficult to establish
>proabilities of multiple POS tags for a given word.

The CLAWS tagger originally developed to PoS-tag the LOB Corpus did this
(as presumably did later versions of CLAWS used on SEC, BNC etc) - the
tagger included the option of outputting all tags allocated by
lexicon+suffixlist, along with context-dependent weights.  The proofreader
had to mark all cases where the correct tag wasn't the highest-weighted;
then a "cleanup" program "rubbed out" all but first tag (unless
proofreading marked another tag, in which case this was left as singel
correct tag).  Using a Markov sequence-based model is not a problem - the
relative weight attached to a tag can be the weight from the best sequence
using it, or the sum of all sequences passing through the tag, or some
other function of all sequences including the tag.

The ENGCG English Constraint Grammar tagger/parser would probably appeal
to you even more.  This applies all tags from lexicon, then applies
constraint-rules to rule out candidates incompatible with context. Usually
this leaves only one candidate PoS-tag per word, but where there is an
ambiguous context it leaves more than one tag.

For refs to these and more tagsets, see:
Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S. 2000.
A comparative evaluation of modern English corpus grammatical annotation schemes
ICAME Journal, volume 24, pages 7-23, International Computer Archive of
Modern and medieval English, HIT Centre, Bergen University. ISSN:0801-5775

--
Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
School of Computing, University of Leeds, LEEDS LS2 9JT
TEL: (44)113-2335430  FAX: (44)113-2335468
WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk