Corpora: Question about a Brown Corpus tag

Mark Lewellen lewellen at erols.com
Thu Sep 14 13:42:25 UTC 2000


Frank Mueller pointed out that it is reasonable to leave POS-information
underspecified (i.e. group together POS categories that are difficult to
tag), since POS tagging typically takes place in the context of a larger
task, such as parsing.  The parser can then decide what is the most
appropriate category (e.g. preposition or conjunction).

An alternative to underspecification of POS information is to develop a
POS tagger that records multiple POS in ambiguous contexts (ideally with
probabilities attached to each POS choice).  An advantage to this approach
is that POS-ambiguity information is not 'hard-coded' in advance by the
tag set, but is rather determined by sentence context, and may be extended
to other ambiguities (such as N vs. V).

Could anyone point out projects that have developed such POS taggers, or
submit opinions as to their viability?  One difficulty I notice is that a
typical tagger using an HMM with the Viterbi algorithm determines a most
likely _sequence_ , which would make it difficult to establish proabilities
of multiple POS tags for a given word.

Mark Lewellen

> > on 17 Aug 2000 Eric S Atwell wrote:
> >
> > > Some tag definitions in Brown were clearly
> > > decided by what TAGGIT found computable;
> > > I *guess* linguistic inconsistencies in tagging
> > > some words may be down to drawing boundaries on
> > > grounds of computational tractability rather than
> > > purely linguistic reasons
> >
> > on 17 Aug 2000 Andrew Harley wrote:
> >
> > > This explains how so many taggers can claim 95% or higher
> success rates!
> >
> > > I also know taggers that tagged IN as "preposition
> > > or conjunction" on the same grounds.
> > ------------------------
>
> This is a reasonable decision, because you cannot resolve this ambiguity
> on the grounds of the immediate context (which most taggers use). It is,
> thus, better to keep the POS-information underspecified and resolve the
> ambiguity, when you are doing the parse. Otherwise, your parser has to
> work with unreliable information.
>
> > So what could be the linguistic reasons that Eric was mentioning? For me
> > (with a rather limited linguistic background) the "traditional" criteria
> > for POS determination look quite arbitrary or let's say heuristic.
> >
> > I cannot, for instance, see any advantage of separating "until" in:
> > * until tomorrow (preposition)
> > * until the morning comes (subordinating conjunction)
>
> I agree that you can (or even should) also leave this underspecified
> until you do a full parse. However, at some point you have to make a
> decision, because you have to annotate clauses and you have to annotate
> prepositional phrases. Now, the 'until' (when it is a connector) gives
> you a good cue where the clause starts.
>
> > while not separating "and" in:
> > * you and me (coordinating conjunction)
> > * I go and see (coordinating conjunction)
>
> As 'and' coordinates constituents of the same kind, you can analyse
> sentences like:
>
> 'I came and see.' as: [CL [NP [N I]] [VP [V came] [CO and] [V see]]
> (my ad-hoc annotation ;-))
>
> The use of 'and' does not affect the 'global' structure of the clause.
> However, this is clearly different for 'until' as it introduces a
> prepositional phrase in the one case and a clause in the other.
>



More information about the Corpora mailing list