Corpora: in-line PoS tagger

Thu Sep 6 14:05:42 UTC 2001

Re: POS tagger

Thank you very much for introducing my parser (Apple Pie Parser).
However, I think the accuracy of the output is not as good as Brill's.

By the way, I have improved Brill's tagger by hand. I read his rule
by myself and modify it. For example, I introduced be-verb, have-verb,
number (and more) classes, clean up "there" rules, aux-verb, other verbs
and (and more and more). The accuracy improved from 96.5 to 97 on
test data (and we have an evidence that this is almost the upper limit
because of errors in Penn Treebank. i.e. if you find a better accuracy,
it may be overtrained). It works with stdin/stdout and files.

I have not published the paper (unfortunately rejected by a conference),
but I can provide it to people who really want it (It's still on a
development stage, I was thinking to make it public sometime in the
next year).

The tagger came with some other functions, like sentence splitter,
tokenzer, stemmer, chunker and NE tagger (Some of them are not completed
yet. Also I'm working on implementing dependency analyzer, parser,
function tagger and reguralizer.) It (will) supports several formats
including PTB-tree, PTB-blacket, COLLINS parser input format, MUC format,
CONLL format, (tipster architecture) and (SGML).

The system is called "OAK system".
You can find an introduction page at
http://cs.nyu.edu/cs/projects/proteus/oak

Satoshi Sekine
sekine at cs.nyu.edu