[Corpora-List] POS-tagger maintenance and improvement

Andras Kornai andras at kornai.com
Wed Feb 25 22:53:44 UTC 2009


Serge,

I can't speak for the others but certainly hunpos/hunmorph/hunspell
and the other hun* tools are very open to user contributions, be they
algorithmic, lexical, or just bug reports of any sort. It is the very 
nature of trainable tools that they take on the error pattern of the 
training corpora, and we have seen many reports of people hand-correcting 
the training data they are working with, for examle Mikheev (2002) writes

"We found quite a few infelicities in the original [WSJ corpus]
tokenization and tagging, however, which we had to correct by hand"

and we have the same experience with most corpora we use, including
our own. Creating some kind of clearinghouse or feedback mechanism for
manual corrections, clever postprocessing hacks etc. would certainly
have value, as long as these contributions don't carry restrictive
licensing. There is a minefield here: the SVMTools and the hun* tools
are LGPL (meaning that industry is welcome to participate) while the
Stanford tools are GPL, which explicitly forbids incorporation in
proprietary software. So if you want to send corrections make sure
they are LGPL. 

Andras Kornai

PS. Historically, the NLP community used a "give credit but otherwise
do what you will" license, and the habit of sharing critical material
(e.g. Henry Spencer's freely redistributable regex(3) or Jorge
Stolfi's original set of dictionaries) predates the Free Software
movement.  Originally, the emphasis was very much on making sure
nothing proprietary creeps in, so when the FSF tried to fork ispell
(the precursor of hunspell) this was very strongly resisted by the
creators who saw it as an obstacle to truly free use.  I personally
believe that part of the reason why, in Chris Dyer's words,

"the corpora/NLP community, unlike the software community and
free-encyclopedia communities, has failed to benefit from the "bazaar"
(bizarre?) model of open collaboration"

is that the GPL basically stands in the way of industry-academia
partnerships, FSF claims to the contrary notwithstanding. 



On Wed, Feb 25, 2009 at 11:59:18AM +0000, Serge Sharoff wrote:
> My feeling is that this partly stems from the nature of statistical
> models, with the same pattern applicable to POS tagging, statistical MT
> and Machine Learning.  You cannot improve your model by spotting an
> individual error.  You have to enlarge your corpus or use a different
> training algorithm, and pray that this will cure the error in question.
> It is not that there are no works on improving POS taggers (see
> SVMTagger, Stanford Tagger, hunpos, etc all appearing recently).  They
> report some incremental improvements over the state of the art without
> necessarily resolving the problems you spotted. 
>  
> Serge 
> 
> On Wed, 2009-02-25 at 11:15 +0000, Adam Kilgarriff wrote:
> > All,
> >  
> > My lexicography colleagues and I use POS-tagged corpora all the time,
> > every day, and very frequently spot systematic errors.  (This is for a
> > range of languages, but particularly English.)   We would dearly like
> > to be in a dialogue with the developers of the POS-tagger and/or the
> > relevant language models so the tagger+model could be improved in
> > response to our feedback. (We have been using standard models rather
> > than training our own.)   However it seems, for the taggers and
> > language models we use (mainly TreeTagger, also CLAWS) and also for
> > other market leaders, all of which seem to be from Universities, the
> > developers have little motivation for continuing the improvement of
> > their tagger, since incremental improvements do not make for good
> > research papers, so there is nowhere for our feedback to go, nor any
> > real prospect of these taggers/models improving.
> >  
> > Am I too pessimistic?  Are there ways of improving language models
> > other than developing bigger and better training corpora - not an
> > exercise we have the resources to invest in?  Are there commercial
> > taggers I should be considering (as, in the commercial world, there is
> > motivation for incremental improvements and responding to customer
> > feedback)?
> > 
> > Responses and ideas most welcome
> >  
> > Adam Kilgarriff
> > -- 
> > ================================================
> > Adam Kilgarriff
> >  http://www.kilgarriff.co.uk              
> > Lexical Computing Ltd                   http://www.sketchengine.co.uk
> > Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> > Universities of Leeds and Sussex       adam at lexmasterclass.com
> > ================================================
> > 
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list