[Corpora-List] POS-tagger maintenance and improvement

amsler at cs.utexas.edu amsler at cs.utexas.edu
Wed Feb 25 22:42:38 UTC 2009


It is worth noting that there are two tasks here. One is the  
development of better POS taggers, but the other is the creation of  
correctly tagged freely downloadable text corpora.

The development of better POS taggers is the sort of activity that  
lends itself to periodic competitive evaluations of POS-taggers in the  
manner of SIGSEM or NIST-hosted evaluations, (i.e., groups of  
individuals perfecting their software and periodically trying it out  
against specifically created training and test corpora whose tagging  
is done and corrected so it can serve as a gold standard for  
evaluations).

However, the development of correctly tagged corpora is an activity  
that could be performed en masse by a large community of web users, in  
much the manner that Wikipedia has been created by a community. In  
fact, what seems to suggest itself, is that perhaps a version of  
Wikipedia (or another body of copyright-free text such as works drawn  
from Gutenberg) with POS-tagging (or even more ammbitiously,  
additional tagging for grammatical structure and semantics) could be  
built and grown to serve the community that needs reliably tagged text.

Both tasks share an underlying problem---what standard tags to use?  
How to resolve conflicting opinions about whether text is correctly  
tagged, but the Wikipedia and Project Gutenberg models show us how to  
enlist a mass of people to manually correct the tags initially  
supplied by automated systems.

It would be nice if an existing standard corpus could be used, such as  
the BNC, but I don't see that happening because of copyright issues.  
However, there is nothing stopping us from creating alternate  
correctly tagged texts based on Gutenberg works or Wikipedia articles  
and offering them to those sites as alternative distribution texts.

Everyone keeps asking how the semantic web is going to come into  
existence. Maybe this is how it starts?

R. Amsler






_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list