[Corpora-List] POS-tagger maintenance and improvement

Chris Dyer redpony at umd.edu
Wed Feb 25 12:09:08 UTC 2009


I think Adam brings up an interesting point.  It is certainly the case
that the corpora/NLP community, unlike the software community and
free-encyclopedia communities, has failed to benefit from the "bazaar"
(bizarre?) model of open collaboration that has produced such
successes as Linux and Wikipedia.  This may be an unavoidable
situation for a variety a reasons--for example, most useful corpora
contain copyrighted material, and most NLP software is generated as a
research effort.  But, I do wonder if a grassroots effort (say, we
propose a model that would enable incremental improvements to corpora,
models, and software) might be able to convince LDC, for example, to
consider hosting a facility for enabling a community updates to widely
used resources.

Chris

On Wed, Feb 25, 2009 at 11:15 AM, Adam Kilgarriff
<adam at lexmasterclass.com> wrote:
> All,
>
> My lexicography colleagues and I use POS-tagged corpora all the time, every
> day, and very frequently spot systematic errors.  (This is for a range of
> languages, but particularly English.)   We would dearly like to be in a
> dialogue with the developers of the POS-tagger and/or the relevant language
> models so the tagger+model could be improved in response to our
> feedback. (We have been using standard models rather than training our
> own.)   However it seems, for the taggers and language models we use (mainly
> TreeTagger, also CLAWS) and also for other market leaders, all of which seem
> to be from Universities, the developers have little motivation for
> continuing the improvement of their tagger, since
> incremental improvements do not make for good research papers, so there is
> nowhere for our feedback to go, nor any real prospect of these
> taggers/models improving.
>
> Am I too pessimistic?  Are there ways of improving language models other
> than developing bigger and better training corpora - not an exercise we have
> the resources to invest in?  Are there commercial taggers I should be
> considering (as, in the commercial world, there is motivation for
> incremental improvements and responding to customer feedback)?
> Responses and ideas most welcome
>
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff
>  http://www.kilgarriff.co.uk
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list