[Corpora-List] POS tagging via relational databases
Mark Davies
Mark_Davies at byu.edu
Wed Sep 24 19:17:55 UTC 2003
Is anyone aware of projects in which relational databases have been used
to do POS tagging? Rather than passing through a linear text token by
token, it would all be done via adjacent rows in the database, using
subqueries or JOINs. For example, you would have a table with N number
of rows, where N = number of words in the corpus. Each row would have
the following structure (lemma would probably be here as well):
ID word pos
----- ----- -----
. . .
516 the AT0
517 play NN1
518 by PREP
519 Ibsen NP0
. . .
1450 wants VVZ
1451 to PRP
1452 play VVI
. . .
To disambiguate words like <play, strike, hit> to NOUN after a DET, the
query would look something like:
update t2
set t2.pos = 'NN1'
from tagger as t1, tagger as t2
where t2.word = 'play' and t1.pos = 'AT0'
and t2.ID = t1.ID + 1
Of course, rather than dealing with specific word forms (e.g. <play>
above), you could use a sub-query to apply it to hundreds or thousands
of items from another table (e.g. the lexicon). Likewise, you could
apply it to all words that have a particular POS, as in the following,
where all doubly-tagged <NN1-VVZ> go to <NN1> after <AT0>:
update t2
set t2.pos = 'NN1'
from tagger as t1, tagger as t2
where t2.pos = 'NN1-VVZ' and t1.pos = 'AT0'
and t2.ID = t1.ID + 1
Anyway, assuming a robust relational database (e.g. SQL Server or
Oracle), it should be possible to tag a decent-sized corpus (e.g. one
million words) in less than an hour -- perhaps just a few minutes -- by
doing the following:
1) inserting POS and lemma information from the lexicon into the corpus
(via simply UPDATE and JOIN commands) and then
2) disambiguation, by applying hundreds of rules (like those described
above) to the tagged corpus
You could also:
3) use morphological rules to disambiguate forms. For example, if
<roller-blading> is not found in the lexicon, you would guess its tag
from the <-ING>. In a more powerful way, you could tag forms that are
not in the lexicon by using subqueries. For example, assuming that
<mopeds> is not in the lexicon, you could run a sub-query to look for
the base form <moped>, and if it is found as an <NN1>, then you assign
<NN2> to <mopeds>. Again, this query could be run on many words in the
corpus all at one time -- via a simply UPDATE command.
In essence, then, the approach to tagging is kind of like a Brill
tagger, but with all of the disambiguation done within the relational
database itself.
Anyway, has anyone seen such an approach? I'd be happy to share a
summary of your comments, if there is sufficient response.
Thanks in advance,
Mark Davies
=================================================
Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu
** Corpus design and use // Web-database scripting **
** Historical linguistics // Functional-typological grammar **
** Spanish and Portuguese historical and dialectal syntax **
=================================================
More information about the Corpora
mailing list