[Corpora-List] POS tagging via relational databases

Wed Sep 24 19:17:55 UTC 2003

Is anyone aware of projects in which relational databases have been used
to do POS tagging?  Rather than passing through a linear text token by
token, it would all be done via adjacent rows in the database, using
subqueries or JOINs.  For example, you would have a table with N number
of rows, where N = number of words in the corpus.  Each row would have
the following structure (lemma would probably be here as well):

	ID	word	pos 
	-----	-----	-----
	. . .
	516	the	AT0
	517	play	NN1
	518	by	PREP
	519	Ibsen	NP0
	. . . 
	1450	wants	VVZ
	1451	to	PRP
	1452	play	VVI
	. . . 

To disambiguate words like <play, strike, hit> to NOUN after a DET, the
query would look something like:

	update t2
	set t2.pos = 'NN1'
	from tagger as t1, tagger as t2
	where t2.word = 'play' and t1.pos = 'AT0' 
		and t2.ID = t1.ID + 1

Of course, rather than dealing with specific word forms (e.g. <play>
above), you could use a sub-query to apply it to hundreds or thousands
of items from another table (e.g. the lexicon).  Likewise, you could
apply it to all words that have a particular POS, as in the following,
where all doubly-tagged <NN1-VVZ> go to <NN1> after <AT0>:

	update t2
	set t2.pos = 'NN1'
	from tagger as t1, tagger as t2
	where t2.pos = 'NN1-VVZ' and t1.pos = 'AT0' 
		and t2.ID = t1.ID + 1

Anyway, assuming a robust relational database (e.g. SQL Server or
Oracle), it should be possible to tag a decent-sized corpus (e.g. one
million words) in less than an hour -- perhaps just a few minutes -- by
doing the following:

1) inserting POS and lemma information from the lexicon into the corpus
(via simply UPDATE and JOIN commands) and then
2) disambiguation, by applying hundreds of rules (like those described
above) to the tagged corpus

You could also:

3) use morphological rules to disambiguate forms.  For example, if
<roller-blading> is not found in the lexicon, you would guess its tag
from the <-ING>.  In a more powerful way, you could tag forms that are
not in the lexicon by using subqueries.  For example, assuming that
<mopeds> is not in the lexicon, you could run a sub-query to look for
the base form <moped>, and if it is found as an <NN1>, then you assign
<NN2> to <mopeds>.  Again, this query could be run on many words in the
corpus all at one time -- via a simply UPDATE command.

In essence, then, the approach to tagging is kind of like a Brill
tagger, but with all of the disambiguation done within the relational
database itself.

Anyway, has anyone seen such an approach?  I'd be happy to share a
summary of your comments, if there is sufficient response.

Thanks in advance,

Mark Davies

=================================================
Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu

** Corpus design and use // Web-database scripting **
** Historical linguistics // Functional-typological grammar **
** Spanish and Portuguese historical and dialectal syntax **
=================================================