[Corpora-List] POS-tagger maintenance and improvement
Eric Atwell
eric at comp.leeds.ac.uk
Wed Feb 25 13:48:54 UTC 2009
Adam,
Majdi Sawalha here at Leeds has volunteered to investigate how easy it is
to train TreeTagger for Arabic; and then if this works, hwo he might
make use of any feedback you might have on systematic errors. However, I
fear this may not be practicable: (i) the Treetagger model may not work
for Arabic, and (ii) the model is corpus-derived and so may not be
"tweakable" to deal with systematic errors. I *think* the underlying
TreeTagger model uses a lexicon and suffix-list to assign one or more
possible PoS-tags to each word, then uses a decision-tree (trained on a
tagged corpus) to select the best tag compatible with context.
BUT Arabic has complex morphology, and a PoS-tag is a bundle of features
derived from a bundle of morphemes; many words will not appear in a
corpus-derived lexicon, and suffix alone will only be a partial clue to
full PoS-tag feature-set. Also, because of the complex morphology, there
are a very large number of possible feature-combinations leading to a
large PoS-tagset, so even the decision-tree model needs a very large
training corpus to avoid training data sparseness.
As others have commented, TreeTagger models for other languages
are also derived from a PoS-tagged corpus, whcih suggest the only way to
eradicate systematic errors is to "correct" the tagging in the training
corpus, or perhaps to use a different corpus altogether.
Eric Atwell, Leeds University
On Wed, 25 Feb 2009, Adam Kilgarriff wrote:
> All,
>
> My lexicography colleagues and I use POS-tagged corpora all the time,
> every day, and very frequently spot systematic errors. (This is for a
> range of languages, but particularly English.) We would dearly like to
> be in a dialogue with the developers of the POS-tagger and/or the
> relevant language models so the tagger+model could be improved in
> response to our feedback. (We have been using standard models rather than
> training our own.) However it seems, for the taggers and language
> models we use (mainly TreeTagger, also CLAWS) and also for other market
> leaders, all of which seem to be from Universities, the developers have
> little motivation for continuing the improvement of their tagger, since
> incremental improvements do not make for good research papers, so there
> is nowhere for our feedback to go, nor any real prospect of these
> taggers/models improving.
>
> Am I too pessimistic? Are there ways of improving language models other
> than developing bigger and better training corpora - not an exercise we
> have the resources to invest in? Are there commercial taggers I should
> be considering (as, in the commercial world, there is motivation for
> incremental improvements and responding to customer feedback)?
> Responses and ideas most welcome
>
> Adam Kilgarriff
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd http://www.lexmasterclass.com
> Universities of Leeds and Sussex adam at lexmasterclass.com
> ================================================
>
>
--
Eric Atwell,
Senior Lecturer, Language research group, School of Computing,
Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
TEL: 0113-3435430 FAX: 0113-3435468 WWW/email: google Eric Atwell
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list