[Corpora-List] Current state of the art of POS tagging/evaluation?
Hans van Halteren
hvh at let.ru.nl
Fri May 4 08:38:45 UTC 2007
At 16:22 3-5-2007, Orion Buckminster Montoya wrote:
>We would appreciate pointers for any of the following:
> * papers describing comparative evaluation exercises
See
Hans van Halteren, Jakub Zavrel, Walter Daelemans:
Improving Accuracy in NLP Through Combination of Machine
Learning Systems. Computational Linguistics 27(2): 199-229 (2001)
You should probaly also have a look at chapters 6 and 7 of
Hans van Halteren (ed.), Syntactic Wordclass Tagging, Kluwer, 1999
which deal with evaluation (6) and choosing a tagger for your own use (7).
>Since tagger performance, for many taggers, depends on the quality and
>volume of training text, we'd also appreciate pointers on how that can
>be brought in to the evaluation, to give us a good idea of which
>tagger will perform best on our dataset.
Do you want to evaluate taggers or tagger generators?
> * other taggers that we should consider
I have taggers trained on written/spoken material from the BNC sampler.
And the tagger generator with which I made them. For a description, see
H. van Halteren, The Detection of Inconsistency in Manually Tagged Text,
Proc. Worshop on Linguistically Interpreted Corpora 2000 (LINC 2000), 2000
>I would be particularly pleased to find a top-quality tagger with
>freely modifiable source code.
Running the tagger or generator in a comparison is no problem (if you
have access to Linux). For making the source code available, I'd have
to discuss things with the department here.
> * data to use as 'gold standard': we are aware of the BNC
>sampler and the Penn TreeBank, though we are also aware of the roles
>these datasets have played as training and development material, for
>various taggers. The OEC is web-sourced and covers a wide range of
>text types so ideally we shall evaluate it on a dataset like that.
Sounds like you'll have to take a (representative) sample from your own
corpus and tag it by hand. Did you decide on a tagset yet?
Keep us informed,
Hans van Halteren
More information about the Corpora
mailing list