[Corpora-List] POS-tagger maintenance and improvement
amsler at cs.utexas.edu
amsler at cs.utexas.edu
Wed Feb 25 22:42:38 UTC 2009
It is worth noting that there are two tasks here. One is the
development of better POS taggers, but the other is the creation of
correctly tagged freely downloadable text corpora.
The development of better POS taggers is the sort of activity that
lends itself to periodic competitive evaluations of POS-taggers in the
manner of SIGSEM or NIST-hosted evaluations, (i.e., groups of
individuals perfecting their software and periodically trying it out
against specifically created training and test corpora whose tagging
is done and corrected so it can serve as a gold standard for
evaluations).
However, the development of correctly tagged corpora is an activity
that could be performed en masse by a large community of web users, in
much the manner that Wikipedia has been created by a community. In
fact, what seems to suggest itself, is that perhaps a version of
Wikipedia (or another body of copyright-free text such as works drawn
from Gutenberg) with POS-tagging (or even more ammbitiously,
additional tagging for grammatical structure and semantics) could be
built and grown to serve the community that needs reliably tagged text.
Both tasks share an underlying problem---what standard tags to use?
How to resolve conflicting opinions about whether text is correctly
tagged, but the Wikipedia and Project Gutenberg models show us how to
enlist a mass of people to manually correct the tags initially
supplied by automated systems.
It would be nice if an existing standard corpus could be used, such as
the BNC, but I don't see that happening because of copyright issues.
However, there is nothing stopping us from creating alternate
correctly tagged texts based on Gutenberg works or Wikipedia articles
and offering them to those sites as alternative distribution texts.
Everyone keeps asking how the semantic web is going to come into
existence. Maybe this is how it starts?
R. Amsler
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list