[Corpora-List] POS-tagger maintenance and improvement

Eckhard Bick eckhard.bick at mail.dk
Wed Feb 25 12:06:55 UTC 2009


Hello,

This is an interesting observation.

Maybe one explanation for the lack of response to user-feedback is that 
it is much harder to make incremental changes to probabilistic / 
machine-learned systems than to rule-based ones. If a corpus user 
identifies systematic errors this can - in a rule-based parser - be used 
to remove errors or add rules, or introduce new lexical sets and 
categories, while in an ML-system this would have to be done by paying 
somebody to annotate the changes into a treebank, which is, as you say, 
unlikely.

Though my view is probably biased, I think this might be an example of 
the side-effects of using trained systems for corpus work rather than 
rule-based ones (like AGFL or CG, to name a couple).

Best regards,
Eckhard Bick

Adam Kilgarriff wrote:
> All,
>  
> My lexicography colleagues and I use POS-tagged corpora all the time, 
> every day, and very frequently spot systematic errors.  (This is for a 
> range of languages, but particularly English.)   We would dearly like 
> to be in a dialogue with the developers of the POS-tagger and/or the 
> relevant language models so the tagger+model could be improved in 
> response to our feedback. (We have been using standard models rather 
> than training our own.)   However it seems, for the taggers and 
> language models we use (mainly TreeTagger, also CLAWS) and also for 
> other market leaders, all of which seem to be from Universities, the 
> developers have little motivation for continuing the improvement of 
> their tagger, since incremental improvements do not make for good 
> research papers, so there is nowhere for our feedback to go, nor any 
> real prospect of these taggers/models improving.
>  
> Am I too pessimistic?  Are there ways of improving language models 
> other than developing bigger and better training corpora - not an 
> exercise we have the resources to invest in?  Are there commercial 
> taggers I should be considering (as, in the commercial world, there is 
> motivation for incremental improvements and responding to customer 
> feedback)?
> Responses and ideas most welcome
>  
> Adam Kilgarriff
> -- 
> ================================================
> Adam Kilgarriff                                     
>  http://www.kilgarriff.co.uk              
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com 
> <mailto:adam at lexmasterclass.com>
> ================================================
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   


-- 
Eckhard Bick,
cand.med., dr.phil.
University of Southern Denmark
e-mail: eckhard.bick at mail.dk
web: http://beta.visl.sdu.dk


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list