[Corpora-List] Current state of the art of POS tagging/evaluation?

Rayson, Paul rayson at exchange.lancs.ac.uk
Sun May 6 08:20:27 UTC 2007


Hi,

Presumably you've found our website already, but I've recently placed
online PDF versions for the key references for CLAWS at:

http://www.comp.lancs.ac.uk/ucrel/claws/

On that page are pointers to the detailed error analysis carried out by
Smith and Leech in the BNC manual. 

You should also note the paragraph about the template tagger software
which was developed in the BNC enhancement project to improve tagging
accuracy and robustness of CLAWS POS tagging. Nick Smith has been
working on the C8 tagset which makes further distinctions in the
determiner and pronoun categories as well as for auxiliary verbs. The C8
tagset is implemented via the template tagger as a post processor to
CLAWS output.

For CLAWS itself, we have versions for various Unix flavours, Windows
and are working on OSX. Source code could be made available under
licence. Please get in touch off list for further information.

Regards,
Paul.

Dr. Paul Rayson
Director of UCREL
Computing Department, Infolab21, South Drive, Lancaster University,
Lancaster, LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/computing/users/paul/
Tel: +44 1524 510357 Fax: +44 1524 510492


-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Orion Buckminster Montoya
Sent: 03 May 2007 15:22
To: corpora at uib.no
Subject: [Corpora-List] Current state of the art of POS
tagging/evaluation?

We are looking to evaluate POS-taggers for English, to establish which
to use for future tagging of the Oxford English Corpus.  Taggers we
are aware of, and hope to evaluate, include

	CLAWS
	RASP 
	EngCG
	Connexor
	TreeTagger
	Brill 

We would appreciate pointers for any of the following:

	* other taggers that we should consider
	* papers describing comparative evaluation exercises
	* data to use as 'gold standard': we are aware of the BNC
sampler and the Penn TreeBank, though we are also aware of the roles
these datasets have played as training and development material, for
various taggers.  The OEC is web-sourced and covers a wide range of
text types so ideally we shall evaluate it on a dataset like that.

Since tagger performance, for many taggers, depends on the quality and
volume of training text, we'd also appreciate pointers on how that can
be brought in to the evaluation, to give us a good idea of which
tagger will perform best on our dataset.

I would be particularly pleased to find a top-quality tagger with
freely modifiable source code.

Many thanks; offlist replies will be summarized, but on-list replies
may prove interesting.

--
Orion Montoya
Data & Development Editor
Dictionaries
Oxford University Press



More information about the Corpora mailing list