[Corpora-List] POS-tagger maintenance and improvement

Thu Feb 26 09:37:55 UTC 2009

Hi Marco, Adam and all,

I should also tell that we at Linguateca, for Portuguese, have been using the PALAVRAS parser of Eckhard Bick (VISL) ever since 1999, and we have often reported errors and bugs which he has always corrected and therefore steadily improved the parser over the years. 

So, my experience is quite opposed to Adam's as well -- and this in the context of a research project, no commercial advantages AFAIK :-)

But it appears to me that it is utterly naive to expect that a bunch of people will help to create or improve a consistent and high-quality resource as far as linguistic analysis is concerned. We have tried (together with Eckhard Bick) to develop a treebank for Portuguese along these lines, Floresta Sintáctica www.linguateca.pt/Floresta, and got no volunteers (so it had to be built by project members alone). 

As well as in the context of many other Linguateca projects concerned with making publicly available (annotated) resources, people may ocasionally report problems (which we try to correct), but most often than not their "problems" simply represent different linguistic views on particular phenomena. 

In my opinion, this is the real bottleneck in getting large annotated texts -- agreement on text analysis is not trivial, and as already mentioned may depend on the application/goal.

Let me also raise the following issue:

I would also like to point out that "POS tagger" is definitely a misnomer. Most POS taggers for English mark much more than part-of-speech, and most taggers (for example the CG taggers of Eckhard Bick) do much more than POS analysis: In fact, they do a full-fledged sysntactic analysis, just they mark it per word (being dependency based). There are also semantic taggers etc. ... you can tag whatever you are interested in!

So, I suggest that one uses the word "tagger" vs. "parser" as far as the ouptput format goes (tagger does not add non-terminals, a parser does, or rather explicitly marks structure), and forget about POS, since nothing in this discussion is actually related to POS but to any linguistic annotation of text. 

BTW, I think the first instance of POS tagger in the literature was Ken Church's, as a first step towards full parsing of English. 

Church, Kenneth Ward. "A stochastic Parts Program and Noun Phrase Parser for Unrestricted Text", Proceedings of the Second Conference on Applied Natural Language Processing (ACL), 1988, pp.136-43.

is this right? I would like to be corrected if this is not the case and this expression/term was coined by someone else. In fact he calls it as well a "parts program" as you can see by the title.

Diana

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] 
> On Behalf Of Marco Baroni
> Sent: 26. februar 2009 09:41
> To: Helmut Schmid
> Cc: Corpora List; Sue Atkins; Valerie GRUNDY; Patrick Hanks
> Subject: Re: [Corpora-List] POS-tagger maintenance and improvement
> 
> And, as a happy user of the TreeTagger, I would like to 
> emphasize that 
> whenever we updated our Italian training corpus and lexicon, Helmut 
> has always retrained the model and posted the new version on the site 
> within a span of a few days, so my experience has been very different 
> from the one described by Adam, in this respect.
> 
> Regards,
> 
> Marco
> 
> Helmut Schmid wrote:
> > Hi Adam,
> > 
> > as the developer of the TreeTagger, I would like to 
> emphasize that I am 
> > still maintaining this software and that any feedback and  
> suggestions 
> > for improvements are highly welcome! I am also very interested in 
> > collaborations for training the TreeTagger on new languages.
> > 
> > Best regards,
> >   Helmut Schmid
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora