[Corpora-List] Current state of the art of POS tagging/evaluation?
Orion Buckminster Montoya
orion at mdcclv.com
Thu May 3 14:22:16 UTC 2007
We are looking to evaluate POS-taggers for English, to establish which
to use for future tagging of the Oxford English Corpus. Taggers we
are aware of, and hope to evaluate, include
CLAWS
RASP
EngCG
Connexor
TreeTagger
Brill
We would appreciate pointers for any of the following:
* other taggers that we should consider
* papers describing comparative evaluation exercises
* data to use as 'gold standard': we are aware of the BNC
sampler and the Penn TreeBank, though we are also aware of the roles
these datasets have played as training and development material, for
various taggers. The OEC is web-sourced and covers a wide range of
text types so ideally we shall evaluate it on a dataset like that.
Since tagger performance, for many taggers, depends on the quality and
volume of training text, we'd also appreciate pointers on how that can
be brought in to the evaluation, to give us a good idea of which
tagger will perform best on our dataset.
I would be particularly pleased to find a top-quality tagger with
freely modifiable source code.
Many thanks; offlist replies will be summarized, but on-list replies
may prove interesting.
--
Orion Montoya
Data & Development Editor
Dictionaries
Oxford University Press
More information about the Corpora
mailing list