[Corpora-List] Current state of the art of POS tagging/evaluation?

Orion Buckminster Montoya orion at mdcclv.com
Thu May 3 14:22:16 UTC 2007


We are looking to evaluate POS-taggers for English, to establish which
to use for future tagging of the Oxford English Corpus.  Taggers we
are aware of, and hope to evaluate, include

	CLAWS
	RASP 
	EngCG
	Connexor
	TreeTagger
	Brill 

We would appreciate pointers for any of the following:

	* other taggers that we should consider
	* papers describing comparative evaluation exercises
	* data to use as 'gold standard': we are aware of the BNC
sampler and the Penn TreeBank, though we are also aware of the roles
these datasets have played as training and development material, for
various taggers.  The OEC is web-sourced and covers a wide range of
text types so ideally we shall evaluate it on a dataset like that.

Since tagger performance, for many taggers, depends on the quality and
volume of training text, we'd also appreciate pointers on how that can
be brought in to the evaluation, to give us a good idea of which
tagger will perform best on our dataset.

I would be particularly pleased to find a top-quality tagger with
freely modifiable source code.

Many thanks; offlist replies will be summarized, but on-list replies
may prove interesting.

--
Orion Montoya
Data & Development Editor
Dictionaries
Oxford University Press



More information about the Corpora mailing list