[Corpora-List] POS-tagger maintenance and improvement
Serge Sharoff
s.sharoff at leeds.ac.uk
Wed Feb 25 11:59:18 UTC 2009
My feeling is that this partly stems from the nature of statistical
models, with the same pattern applicable to POS tagging, statistical MT
and Machine Learning. You cannot improve your model by spotting an
individual error. You have to enlarge your corpus or use a different
training algorithm, and pray that this will cure the error in question.
It is not that there are no works on improving POS taggers (see
SVMTagger, Stanford Tagger, hunpos, etc all appearing recently). They
report some incremental improvements over the state of the art without
necessarily resolving the problems you spotted.
Serge
On Wed, 2009-02-25 at 11:15 +0000, Adam Kilgarriff wrote:
> All,
>
> My lexicography colleagues and I use POS-tagged corpora all the time,
> every day, and very frequently spot systematic errors. (This is for a
> range of languages, but particularly English.) We would dearly like
> to be in a dialogue with the developers of the POS-tagger and/or the
> relevant language models so the tagger+model could be improved in
> response to our feedback. (We have been using standard models rather
> than training our own.) However it seems, for the taggers and
> language models we use (mainly TreeTagger, also CLAWS) and also for
> other market leaders, all of which seem to be from Universities, the
> developers have little motivation for continuing the improvement of
> their tagger, since incremental improvements do not make for good
> research papers, so there is nowhere for our feedback to go, nor any
> real prospect of these taggers/models improving.
>
> Am I too pessimistic? Are there ways of improving language models
> other than developing bigger and better training corpora - not an
> exercise we have the resources to invest in? Are there commercial
> taggers I should be considering (as, in the commercial world, there is
> motivation for incremental improvements and responding to customer
> feedback)?
>
> Responses and ideas most welcome
>
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> ================================================
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list