Corpora: Question about a Brown Corpus tag

Atro Voutilainen atro.voutilainen at conexor.fi
Thu Sep 14 14:57:12 UTC 2000


Eric,

Thanks for mentioning ENGCG. A recent version, called EngCG-2, can be tested at
http://www.conexor.fi; also an online evaluation paper can be found there.

Atro Voutilainen


E S Atwell wrote:
>
> Mark Lewellen asked:
>
> >An alternative to underspecification of POS information is to develop a
> >POS tagger that records multiple POS in ambiguous contexts (ideally with
> >probabilities attached to each POS choice).  An advantage to this
> >approach is that POS-ambiguity information is not 'hard-coded' in advance
> >by the tag set, but is rather determined by sentence context, and may be
> >extended to other ambiguities (such as N vs. V).
> >
> >Could anyone point out projects that have developed such POS taggers, or
> >submit opinions as to their viability?  One difficulty I notice is that a
> >typical tagger using an HMM with the Viterbi algorithm determines a most
> >likely _sequence_ , which would make it difficult to establish
> >proabilities of multiple POS tags for a given word.
>
> The CLAWS tagger originally developed to PoS-tag the LOB Corpus did this
> (as presumably did later versions of CLAWS used on SEC, BNC etc) - the
> tagger included the option of outputting all tags allocated by
> lexicon+suffixlist, along with context-dependent weights.  The proofreader
> had to mark all cases where the correct tag wasn't the highest-weighted;
> then a "cleanup" program "rubbed out" all but first tag (unless
> proofreading marked another tag, in which case this was left as singel
> correct tag).  Using a Markov sequence-based model is not a problem - the
> relative weight attached to a tag can be the weight from the best sequence
> using it, or the sum of all sequences passing through the tag, or some
> other function of all sequences including the tag.
>
> The ENGCG English Constraint Grammar tagger/parser would probably appeal
> to you even more.  This applies all tags from lexicon, then applies
> constraint-rules to rule out candidates incompatible with context. Usually
> this leaves only one candidate PoS-tag per word, but where there is an
> ambiguous context it leaves more than one tag.
>
> For refs to these and more tagsets, see:
> Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, and Wilcock S. 2000.
> A comparative evaluation of modern English corpus grammatical annotation schemes
> ICAME Journal, volume 24, pages 7-23, International Computer Archive of
> Modern and medieval English, HIT Centre, Bergen University. ISSN:0801-5775
>
> --
> Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Tutor
> School of Computing, University of Leeds, LEEDS LS2 9JT
> TEL: (44)113-2335430  FAX: (44)113-2335468
> WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk

--
Atro Voutilainen                              mobile: +358 50 5437452
Conexor oy                                       fax: +358 9 37468502
Helsinki Science Park                     atro.voutilainen at conexor.fi
Koetilantie 3, 00710 Helsinki, Finland          http://www.conexor.fi



More information about the Corpora mailing list