[Corpora-List] POS-tagger maintenance and improvement

Wed Feb 25 13:48:54 UTC 2009

Adam,

Majdi Sawalha here at Leeds has volunteered to investigate how easy it is 
to train TreeTagger for Arabic; and then if this works, hwo he might
make use of any feedback you might have on systematic errors. However, I
fear this may not be practicable: (i) the Treetagger model may not work 
for Arabic, and (ii) the model is corpus-derived and so may not be 
"tweakable" to deal with systematic errors.  I *think* the underlying
TreeTagger model uses a lexicon and suffix-list to assign one or more
possible PoS-tags to each word, then uses a decision-tree (trained on a
tagged corpus) to select the best tag compatible with context. 
BUT Arabic has complex morphology, and a PoS-tag is a bundle of features
derived from a bundle of morphemes; many words will not appear in a
corpus-derived lexicon, and suffix alone will only be a partial clue to 
full PoS-tag feature-set. Also, because of the complex morphology, there
are a very large number of possible feature-combinations leading to a
large PoS-tagset, so even the decision-tree model needs a very large
training corpus to avoid training data sparseness.

As others have commented, TreeTagger models for other languages 
are also derived from a PoS-tagged corpus, whcih suggest the only way to 
eradicate systematic errors is to "correct" the tagging in the training
corpus, or perhaps to use a different corpus altogether.

Eric Atwell, Leeds University

On Wed, 25 Feb 2009, Adam Kilgarriff wrote:

> All,
>  
> My lexicography colleagues and I use POS-tagged corpora all the time,
> every day, and very frequently spot systematic errors.  (This is for a
> range of languages, but particularly English.)   We would dearly like to
> be in a dialogue with the developers of the POS-tagger and/or the
> relevant language models so the tagger+model could be improved in
> response to our feedback. (We have been using standard models rather than
> training our own.)   However it seems, for the taggers and language
> models we use (mainly TreeTagger, also CLAWS) and also for other market
> leaders, all of which seem to be from Universities, the developers have
> little motivation for continuing the improvement of their tagger, since
> incremental improvements do not make for good research papers, so there
> is nowhere for our feedback to go, nor any real prospect of these
> taggers/models improving.
>  
> Am I too pessimistic?  Are there ways of improving language models other
> than developing bigger and better training corpora - not an exercise we
> have the resources to invest in?  Are there commercial taggers I should
> be considering (as, in the commercial world, there is motivation for
> incremental improvements and responding to customer feedback)?
> Responses and ideas most welcome
>  
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff                                    
>  http://www.kilgarriff.co.uk              
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
> 
>

-- 
Eric Atwell,
  Senior Lecturer, Language research group, School of Computing,
  Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
  TEL: 0113-3435430  FAX: 0113-3435468  WWW/email: google Eric Atwell
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora