[Corpora-List] POS-tagger maintenance and improvement

WHITELOCK, Pete pete.whitelock at oup.com
Thu Feb 26 01:04:33 UTC 2009


Hi Brett,

There are obvious limitations in trying to shoehorn the behaviour of
words and phrases into the straitjacket of a single atomic symbol. What
should be obvious though is that for the purposes of tagging such atomic
symbols should reflect the distributional characteristics of the items
they label and not anything else such as their function or morphology,
because distributions are what taggers are trained on and what they are
intended to classify. In this regard, the labeling of "last Sunday" as
an adverb seems eminently sensible, since its distribution is precisely
that of a (temporal) adverb rather than that of an arbitrary noun
phrase. We wouldn't want to consider "paint" as a ditransitive verb in
the sentence "He painted his house last Sunday". I would expect a tagger
that assigned "last Sunday" the same tag as "yesterday" to out-perform
one that called it an adjective-noun sequence.

Pete Whitelock
Data and Resources Development Manager
Reference Department
Academic Division
Oxford University Press


On 25-Feb-09, at 8:48 AM, Eric Atwell wrote:
> As others have commented, TreeTagger models for other languages are 
> also derived from a PoS-tagged corpus, whcih suggest the only way to 
> eradicate systematic errors is to "correct" the tagging in the 
> training corpus

I'm a language teacher who dabbles in a variety of things including
linguistics and corpora. I'm an autodidact and don't have any great
expertise in any of these fields, so the following may be completely
obvious to everyone here, or it may be way off the mark:

It seems to me that an inconsistent grammatical description will lead to
inconsistent hand tagging, which, when used to train software, will lead
to inconsistent taggers. The more rigorous our grammar is, the better
our taggers will perform.

To take an English example, there was a recent paper in Language
Learning that referred to "last Sunday" as an adverb in the sentence "He
painted his house last Sunday." This confuses the function of the NP
(modifier) with the category (NP). If this is the kind of input the
software has for training, well, GIGO.

English has the benefit of an analysis like the Cambridge Grammar of the
English Language which may not be a perfectly accurate description of
English, but seems to me head and shoulders above any comprehensive
grammar published about English. I imagine that an English POS tagger
trained on CGEL-based tagsets would immediately outperform those based
on other grammars. I'm not familiar with comprehensive grammars of other
languages, but I'd guess they are plagued with inconsistencies.

For all languages, formal linguists, corpus linguists, corpus builders,
and software developers do need to be in constant interaction. An open
source project would seem a good way to facilitate this, but how do we
make sure there's the payback in terms of academic
credentials/publishing credit? (An interesting tangentially-related
discussion is here:
<http://worthwhile.typepad.com/worthwhile_canadian_initi/2009/02/
economics-blogging-and-academia.html>)

Best,
Brett

<http://english-jack.blogspot.com>

-----------------------
Brett Reynolds
English Language Centre
Humber College Institute of Technology and Advanced Learning Toronto,
Ontario, Canada brett.reynolds at humber.ca




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
Oxford University Press (UK) Disclaimer

This message is confidential. You should not copy it or disclose its contents to anyone. You may use and apply the information for the intended purpose only. OUP does not accept legal responsibility for the contents of this message. Any views or opinions presented are those of the author only and not of OUP. If this email has come to you in error, please delete it, along with any attachments. Please note that OUP may intercept incoming and outgoing email communications.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list