[Corpora-List] POS-tagger maintenance and improvement

Brett Reynolds brett at forsyths.ca
Wed Feb 25 16:20:22 UTC 2009


On 25-Feb-09, at 8:48 AM, Eric Atwell wrote:
> As others have commented, TreeTagger models for other languages are  
> also derived from a PoS-tagged corpus, whcih suggest the only way  
> to eradicate systematic errors is to "correct" the tagging in the  
> training
> corpus

I'm a language teacher who dabbles in a variety of things including  
linguistics and corpora. I'm an autodidact and don't have any great  
expertise in any of these fields, so the following may be completely  
obvious to everyone here, or it may be way off the mark:

It seems to me that an inconsistent grammatical description will lead  
to inconsistent hand tagging, which, when used to train software,  
will lead to inconsistent taggers. The more rigorous our grammar is,  
the better our taggers will perform.

To take an English example, there was a recent paper in Language  
Learning that referred to "last Sunday" as an adverb in the sentence  
"He painted his house last Sunday." This confuses the function of the  
NP (modifier) with the category (NP). If this is the kind of input  
the software has for training, well, GIGO.

English has the benefit of an analysis like the Cambridge Grammar of  
the English Language which may not be a perfectly accurate  
description of English, but seems to me head and shoulders above any  
comprehensive grammar published about English. I imagine that an  
English POS tagger trained on CGEL-based tagsets would immediately  
outperform those based on other grammars. I'm not familiar with  
comprehensive grammars of other languages, but I'd guess they are  
plagued with inconsistencies.

For all languages, formal linguists, corpus linguists, corpus  
builders, and software developers do need to be in constant  
interaction. An open source project would seem a good way to  
facilitate this, but how do we make sure there's the payback in terms  
of academic credentials/publishing credit? (An interesting  
tangentially-related discussion is here:
<http://worthwhile.typepad.com/worthwhile_canadian_initi/2009/02/ 
economics-blogging-and-academia.html>)

Best,
Brett

<http://english-jack.blogspot.com>

-----------------------
Brett Reynolds
English Language Centre
Humber College Institute of Technology and Advanced Learning
Toronto, Ontario, Canada
brett.reynolds at humber.ca




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list