Corpora: in-line PoS tagger

Satoshi Sekine sekine at cs.nyu.edu
Thu Sep 6 14:05:42 UTC 2001


Re: POS tagger

Thank you very much for introducing my parser (Apple Pie Parser).
However, I think the accuracy of the output is not as good as Brill's.

By the way, I have improved Brill's tagger by hand. I read his rule
by myself and modify it. For example, I introduced be-verb, have-verb,
number (and more) classes, clean up "there" rules, aux-verb, other verbs
and (and more and more). The accuracy improved from 96.5 to 97 on
test data (and we have an evidence that this is almost the upper limit
because of errors in Penn Treebank. i.e. if you find a better accuracy,
it may be overtrained). It works with stdin/stdout and files.

I have not published the paper (unfortunately rejected by a conference),
but I can provide it to people who really want it (It's still on a
development stage, I was thinking to make it public sometime in the
next year).

The tagger came with some other functions, like sentence splitter,
tokenzer, stemmer, chunker and NE tagger (Some of them are not completed
yet. Also I'm working on implementing dependency analyzer, parser,
function tagger and reguralizer.) It (will) supports several formats
including PTB-tree, PTB-blacket, COLLINS parser input format, MUC format,
CONLL format, (tipster architecture) and (SGML).

The system is called "OAK system".
You can find an introduction page at
http://cs.nyu.edu/cs/projects/proteus/oak


Satoshi Sekine
sekine at cs.nyu.edu



More information about the Corpora mailing list