Corpora: Question about a Brown Corpus tag

Fri Sep 15 16:21:17 UTC 2000

In response to:
> An alternative to underspecification of POS information is to develop a
> POS tagger that records multiple POS in ambiguous contexts (ideally with
> probabilities attached to each POS choice)....
> Could anyone point out projects that have developed such POS taggers, or
> submit opinions as to their viability?

Miles Osborne wrote:

> Check out:
>
> http://www.cs.brown.edu/people/ec/papers/tagforpar.ps
>
> from the abstract:
> >
> We consider what tagging models are most appropriate as front ends for
> probabilistic context-free-grammar parsers. In particular we ask if using
> a tagger that returns more than one tag, a ``multiple tagger,'' improves
> parsing performance. Our conclusion is somewhat surprising: single tag
> Markov-model taggers are quite adequate for the task. First of all,
> parsing accuracy, as measured by the correct assignment of parts of speech
> to words, does not increase significantly when parsers select the tags
> themselves. In addition, the work required to parse a sentence goes up
> with increasing tag ambiguity, though not as much as one might expect.
> Thus, for the moment, single taggers are the best taggers.
> >

I downloaded this article, which argues that a parser should _not_ make use
of
probabilities from a tagger that returns multiple tags with their
probabilities.
This is counter-intuitive to me; however, here is a summary of the argument:
(apologies for generalizing symbols to forms suitable for e-mail)

1) We want to maximize:      p( parse_tree | word_string ).
2) For a context-free grammar, 1) is equivalent to maximizing the product of
the
     probabilities of the rules used in the parse (i.e., max product
p(rules) ).
3) Since we are maximizing p( parse_tree | word_string ), the rules have
words as
     their terminal symbols, so some of the rules are 'lexical rules'.
4) The probability of a lexical rule  p( tag->word ) is p( word | tag ).
5) The 'multiple' tagger results in p( tag | word ).  This is not the
information
    p( word | tag ) that we require.  Using p( tag | word ) here is
analagous to
    the problem of using p( tag | word ) instead of p( word | tag ) in some
early
    HMM taggers.

While I fully understand the logic of this argument, it however is desirable
to
exploit  the information that a 'multiple' tagger provides.  Perhaps Baye's
rule
could be applied, so that we could use ( p( word) x p( tag | word ) ) / p(
tag )
instead of p( word | tag ).

Are there any agreements/disagreements with the above argument, or any other
comments on the application of 'multiple' PoS taggers as front ends to
parsers?
Thanks-

Mark Lewellen