[Corpora-List] POS-tagger maintenance and improvement
Brett Reynolds
brett at forsyths.ca
Wed Feb 25 16:20:22 UTC 2009
On 25-Feb-09, at 8:48 AM, Eric Atwell wrote:
> As others have commented, TreeTagger models for other languages are
> also derived from a PoS-tagged corpus, whcih suggest the only way
> to eradicate systematic errors is to "correct" the tagging in the
> training
> corpus
I'm a language teacher who dabbles in a variety of things including
linguistics and corpora. I'm an autodidact and don't have any great
expertise in any of these fields, so the following may be completely
obvious to everyone here, or it may be way off the mark:
It seems to me that an inconsistent grammatical description will lead
to inconsistent hand tagging, which, when used to train software,
will lead to inconsistent taggers. The more rigorous our grammar is,
the better our taggers will perform.
To take an English example, there was a recent paper in Language
Learning that referred to "last Sunday" as an adverb in the sentence
"He painted his house last Sunday." This confuses the function of the
NP (modifier) with the category (NP). If this is the kind of input
the software has for training, well, GIGO.
English has the benefit of an analysis like the Cambridge Grammar of
the English Language which may not be a perfectly accurate
description of English, but seems to me head and shoulders above any
comprehensive grammar published about English. I imagine that an
English POS tagger trained on CGEL-based tagsets would immediately
outperform those based on other grammars. I'm not familiar with
comprehensive grammars of other languages, but I'd guess they are
plagued with inconsistencies.
For all languages, formal linguists, corpus linguists, corpus
builders, and software developers do need to be in constant
interaction. An open source project would seem a good way to
facilitate this, but how do we make sure there's the payback in terms
of academic credentials/publishing credit? (An interesting
tangentially-related discussion is here:
<http://worthwhile.typepad.com/worthwhile_canadian_initi/2009/02/
economics-blogging-and-academia.html>)
Best,
Brett
<http://english-jack.blogspot.com>
-----------------------
Brett Reynolds
English Language Centre
Humber College Institute of Technology and Advanced Learning
Toronto, Ontario, Canada
brett.reynolds at humber.ca
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list