Corpora: Question about a Brown Corpus tag

Tylman Ule ule at sfs.nphil.uni-tuebingen.de
Thu Sep 14 14:22:06 UTC 2000


Mark Lewellen wrote:
>
> An alternative to underspecification of POS information is to develop a
> POS tagger that records multiple POS in ambiguous contexts (ideally with
> probabilities attached to each POS choice).  An advantage to this approach
> is that POS-ambiguity information is not 'hard-coded' in advance by the
> tag set, but is rather determined by sentence context, and may be extended
> to other ambiguities (such as N vs. V).
>
> Could anyone point out projects that have developed such POS taggers, or
> submit opinions as to their viability?  One difficulty I notice is that a
> typical tagger using an HMM with the Viterbi algorithm determines a most
> likely _sequence_ , which would make it difficult to establish proabilities
> of multiple POS tags for a given word.

You may either use a tagger also specifying alternative tags deemed to
be less probable, or combine a number of taggers to come to a similar
solution via a voting schema, or, of course, do both.

For information regarding system combination via voting, please see e.g.

@InProceedings{,
  author =	 {Hans van Halteren and Jakub Zavrel and Walter
                  Daelemans},
  title =	 {Improving Data Driven Wordclass Tagging by System
                  Combination},
  year =	 1998,
  booktitle =	 "Proceedings of COLING-ACL '98, August",
  address =	 "Montreal, Canada",
  publisher =	 ACL,
  url =		 {ftp://ilk.kub.nl/pub/papers/coling98.ps.gz},
}

I know of at least one tagger providing alternative tags on a given
search beam, namely Thorsten Brant's tnt tagger
(http://www.coli.uni-sb.de/~thorsten/tnt).

And as for the third solution, I am currently investigating that
approach, and results so far look quite promising.


Best,
Tylman


> Mark Lewellen
>
> > > on 17 Aug 2000 Eric S Atwell wrote:
> > >
> > > > Some tag definitions in Brown were clearly
> > > > decided by what TAGGIT found computable;
> > > > I *guess* linguistic inconsistencies in tagging
> > > > some words may be down to drawing boundaries on
> > > > grounds of computational tractability rather than
> > > > purely linguistic reasons
> > >
> > > on 17 Aug 2000 Andrew Harley wrote:
> > >
> > > > This explains how so many taggers can claim 95% or higher
> > success rates!
> > >
> > > > I also know taggers that tagged IN as "preposition
> > > > or conjunction" on the same grounds.
> > > ------------------------
> >
> > This is a reasonable decision, because you cannot resolve this ambiguity
> > on the grounds of the immediate context (which most taggers use). It is,
> > thus, better to keep the POS-information underspecified and resolve the
> > ambiguity, when you are doing the parse. Otherwise, your parser has to
> > work with unreliable information.
> >
> > > So what could be the linguistic reasons that Eric was mentioning? For me
> > > (with a rather limited linguistic background) the "traditional" criteria
> > > for POS determination look quite arbitrary or let's say heuristic.
> > >
> > > I cannot, for instance, see any advantage of separating "until" in:
> > > * until tomorrow (preposition)
> > > * until the morning comes (subordinating conjunction)
> >
> > I agree that you can (or even should) also leave this underspecified
> > until you do a full parse. However, at some point you have to make a
> > decision, because you have to annotate clauses and you have to annotate
> > prepositional phrases. Now, the 'until' (when it is a connector) gives
> > you a good cue where the clause starts.
> >
> > > while not separating "and" in:
> > > * you and me (coordinating conjunction)
> > > * I go and see (coordinating conjunction)
> >
> > As 'and' coordinates constituents of the same kind, you can analyse
> > sentences like:
> >
> > 'I came and see.' as: [CL [NP [N I]] [VP [V came] [CO and] [V see]]
> > (my ad-hoc annotation ;-))
> >
> > The use of 'and' does not affect the 'global' structure of the clause.
> > However, this is clearly different for 'until' as it introduces a
> > prepositional phrase in the one case and a clause in the other.
> >

--
Tylman Ule,  Tel. 07071/29-78490, Fax 07071/551335
	Seminar für Sprachwissenschaft, Universität Tübingen
        Kleine Wilhelmstraße 113, 72074 Tübingen



More information about the Corpora mailing list