[Corpora-List] POS-tagger maintenance and improvement
Rayson, Paul
rayson at exchange.lancs.ac.uk
Wed Feb 25 21:25:08 UTC 2009
Hi Adam,
Presumably you are intending to be provocative there! It is difficult
for universities to get research funding to improve POS taggers for
English in particular. Why would funding agencies provide resources for
this? It is a solved problem isn't it!! :-) The last time UCREL had
large funding indirectly for CLAWS was on the BNC enhancement project in
1996: http://ucrel.lancs.ac.uk/projects.html#bnce In that project we
also developed the template tagger as a more powerful patching tool for
CLAWS, see papers listed at http://ucrel.lancs.ac.uk/claws/
It is certainly feasible for users of CLAWS to add their own
supplementary rule files and dictionary files in order to improve the
tagging. Email me off list for info.
We are still interested in receiving feedback from users of CLAWS and
this has resulted in a number of papers recently as follows. I think the
interesting stuff happens when you try to apply standard tools to
non-standard language e.g. learner data, historical text, dialect
corpora:
Beal, J., Corrigan, K., Smith, N. and Rayson, P. (2007) Writing the
Vernacular: Transcribing and Tagging the Newcastle Electronic Corpus of
Tyneside English. Studies in Variation, Contacts and Change in English.
Volume 1. Research Unit for Variation, Contacts and Change in English
(VARIENG), University of Helsinki.
http://www.helsinki.fi/varieng/journal/volumes/01/beal_et_al/
Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007).
Tagging the Bard: Evaluating the accuracy of a modern POS tagger on
Early Modern English corpora. In proceedings of Corpus Linguistics 2007,
July 27-30, University of Birmingham, UK.
http://ucrel.lancs.ac.uk/people/paul/publications/RaysonEtAl_CL2007.pdf
We've also got a forthcoming paper at the ICAME pre-conference workshop
on 'Errors and disfluencies in spoken corpora' with Joanna
Jendryczka-Wierszycka titled "Applying native language trained
annotation tools to non-native spoken corpora" relating to applying
CLAWS to spoken learner (LINDSEI) material.
For completeness, there are other papers that I refer to in my reply to
a previous question of yours about tagging spoken data on the Corpora
list:
http://www.uib.no/mailman/public/corpora/2005-July/001363.html
Regards,
Paul.
Dr. Paul Rayson
Director of UCREL
Computing Department, Infolab21, South Drive, Lancaster University,
Lancaster, LA1 4WA, UK.
Web: http://www.comp.lancs.ac.uk/computing/users/paul/
<http://www.comp.lancs.ac.uk/computing/users/paul/>
Tel: +44 1524 510357 Fax: +44 1524 510492
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
Of Adam Kilgarriff
Sent: 25 February 2009 11:16
To: Corpora List
Cc: Sue Atkins; Valerie GRUNDY; Patrick Hanks
Subject: [Corpora-List] POS-tagger maintenance and improvement
All,
My lexicography colleagues and I use POS-tagged corpora all the time,
every day, and very frequently spot systematic errors. (This is for a
range of languages, but particularly English.) We would dearly like to
be in a dialogue with the developers of the POS-tagger and/or the
relevant language models so the tagger+model could be improved in
response to our feedback. (We have been using standard models rather
than training our own.) However it seems, for the taggers and language
models we use (mainly TreeTagger, also CLAWS) and also for other market
leaders, all of which seem to be from Universities, the developers have
little motivation for continuing the improvement of their tagger, since
incremental improvements do not make for good research papers, so there
is nowhere for our feedback to go, nor any real prospect of these
taggers/models improving.
Am I too pessimistic? Are there ways of improving language models other
than developing bigger and better training corpora - not an exercise we
have the resources to invest in? Are there commercial taggers I should
be considering (as, in the commercial world, there is motivation for
incremental improvements and responding to customer feedback)?
Responses and ideas most welcome
Adam Kilgarriff
--
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd http://www.sketchengine.co.uk
Lexicography MasterClass Ltd http://www.lexmasterclass.com
Universities of Leeds and Sussex adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090225/9b8f6476/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list