[Corpora-List] POS tagging via relational databases (follow-up)
Mark Davies
Mark_Davies at byu.edu
Thu Sep 25 11:30:43 UTC 2003
Thanks to those who responded to my earlier message -- most via private
communication. Here's a bit of an update.
I ran some sample updates on a 1 million word extract of the BNC that I
have in database format in SQL Server. First I "corrupted" the POS tags
for 20,000-30,000 rows by running a query that would, for example,
change "AJ0" to "AJ0-xxx" after a row with "AT0" where the second row
started with "s-" or "t-". Then I'd run the "correction" update that
would set "AJ0-xxx" to "AJ0" after "AT0" (modeling the resolution of
ambiguity). I ran about twenty such "correction" UPDATE queries in
sequence and noted the total elapsed time.
Each update of 20,000-30,000 rows takes about .4 seconds, meaning that
you could run about thirty of them in 10-12 seconds. This is after the
initial UPDATE of the POS column for all rows in the database from the
lexicon -- which takes about 8-10 seconds. Also, any updates on rows
with specific lexical items (even relatively high frequency items) is
essentially instantaneous.
Anyway, all of this suggests that it would take about 20 seconds to tag
a 1,000,000 word corpus with about thirty rewrite rules, and perhaps 30
seconds for sixty or so rewrite rules. At this rate, one could tag the
entire 100,000,000 word BNC in less than half an hour. This seems
fairly acceptable to me, although some have suggested that this is still
rather slow, as far as state of the art taggers.
Mark Davies
P.S. One or two others questioned the complexity of the SQL
rewrite/UPDATE rules, but these can be easily derived via simple scripts
from more standard rules, such as [NN2-VVZ > NN2 / ATO __]. Also, any
type of ordering problems could -- it seems -- be accounted for as
easily with SQL as with the rewrite rules in the Brill tagger.
=================================================
Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu
** Corpus design and use // Web-database scripting **
** Historical linguistics // Functional-typological grammar **
** Spanish and Portuguese historical and dialectal syntax **
=================================================
More information about the Corpora
mailing list