[Corpora-List] POS-tagger maintenance and improvement

Dale Gerdemann dg at sfs.nphil.uni-tuebingen.de
Thu Feb 26 08:41:41 UTC 2009


Hello all,

It's the same as with technical books. Many authors conscientious keep 
an errata on their web page. The better authors make a second printing 
or a second edition of the book. But in the real world, not every author 
is going to do this. So if the book is important enough to you, then 
you'd  better make your own errata.

If the errors that your tagger makes are systematic enough, then I 
suppose that the corrections could be applied with a transducer. If you 
think this is feasible, then collect a bunch of your errors and send 
them to me, and I will see if I can get a student interested in working 
on the problem. Starting in April, I'm teaching a B.A. thesis seminar on 
finite state methods in NLP, and all of my students need projects. So 
there's a good chance that one of them would be interested in your problem.

Best Regards,

Dale Gerdemann


Adam Kilgarriff wrote:
> All,
>  
> My lexicography colleagues and I use POS-tagged corpora all the time, 
> every day, and very frequently spot systematic errors.  (This is for a 
> range of languages, but particularly English.)   We would dearly like 
> to be in a dialogue with the developers of the POS-tagger and/or the 
> relevant language models so the tagger+model could be improved in 
> response to our feedback. (We have been using standard models rather 
> than training our own.)   However it seems, for the taggers and 
> language models we use (mainly TreeTagger, also CLAWS) and also for 
> other market leaders, all of which seem to be from Universities, the 
> developers have little motivation for continuing the improvement of 
> their tagger, since incremental improvements do not make for good 
> research papers, so there is nowhere for our feedback to go, nor any 
> real prospect of these taggers/models improving.
>  
> Am I too pessimistic?  Are there ways of improving language models 
> other than developing bigger and better training corpora - not an 
> exercise we have the resources to invest in?  Are there commercial 
> taggers I should be considering (as, in the commercial world, there is 
> motivation for incremental improvements and responding to customer 
> feedback)?
> Responses and ideas most welcome
>  
> Adam Kilgarriff
> -- 
> ================================================
> Adam Kilgarriff                                     
>  http://www.kilgarriff.co.uk              
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com 
> <mailto:adam at lexmasterclass.com>
> ================================================
> ------------------------------------------------------------------------
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090226/2b4f9168/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list