[Corpora-List] Problems with the architecture for a large corpus (for SQL or database gurus)

Mark Davies mdavies at ilstu.edu
Tue Oct 22 05:44:27 UTC 2002


RE: [Corpora-List] Legal aspects of corpora compilingI’m experimenting with
a change to the architecture of a 100 million word corpus of Spanish that I’
ve created (www.corpusdelespanol.org).  With my current solution, things
work pretty well.  Nevertheless, for some reasons listed below, I'm
considering a change to another database architecture, but the queries with
the new approach are too slow.  Here’s an overview:

================
CURRENT APPROACH
================

I use large databases of n-grams that contain all of the distinct
1,2,3,4-grams in the entire 100 million word corpus, along with their
frequency in several historical periods and registers of Modern Spanish.
Using this approach, a typical query looks like the following:

select top 300 * from x3 where
w1 in ('le', 'les')
and
w2 in (select w1 from x_L where x1 = 'hacer')
and
w3 in (select w1 from x_c where x1 = 'v_inf')
order by x19 DESC

where w1,w2, and w3 are the three word slots in x3 (a table containing all
of the 40+ million distinct 3-grams in the corpus), and x_L and x_c are two
separate tables containing the lemma or part of speech (column x1) for about
1,000,000 distinct forms (w1).  ( Note: in order to run this query, the user
submits [ le/les  hacer.*  *.v_inf ] (without the brackets)).

The preceding example, then, selects all of the distinct three word strings
composed of [le or les] + [a form of hacer "to do"] + [an infinitive], e.g.
le hice decir, les hagan comprar, etc., and then sorts them by their
frequency in the 1900s (column x19).

This approach has a number of advantages, including the ability to search
directly by frequency in a given sub-corpus (i.e. all of the matching
strings that occur more than three times in the 1900s, but no times in the
1700s or 1800s), and it is also very fast -- about one or two seconds for
rather complicated searches like this.

The downside, and the reason I’m considering changing the architecture of
the database, is that every time I add or delete a text from the corpus, I
have to re-calculate all of the 1,2,3,4-grams tables, which is quite
time-consuming.  In addition, if I want to allow search within a five word
window, for example, I then have to go back and create a new table with all
of the distinct 5-grams in the entire corpus, which also takes a long time.

================
NEW APPROACH
================

So what I’ve been trying to do is to use a new database that is composed of
all of the sequential words in the corpus.  For a 100 million word corpus,
for example, there would be 100 million rows with the two columns: ID
(offset value) and W1 (word), e.g.

1  I
2  saw
3  the
4 man
 . . .
100000000 xxx

To extract strings, I would then use self-joins on this one table, in which
[ID], [ID-1], [ID+1] etc are used to find preceding and following words,
e.g.:

select count(*),w1.w1,w2.w1,w3.w1 from
((select w1, ID+1 as ID from seq where w1 in ('le',’les’)) w1
inner join
(select w1, ID as ID from seq where w1 in (select w1 from x_c where x1 =
‘hacer’)) w2
on w2.ID = w1.ID)
inner join
(select w1, ID-1 as ID from seq where w1 in (select w1 from x_l where x1 =
'ver')) w3
on w3.ID=w1.ID
group by w1.w1,w2.w1,w3.w1

The will produce the same results set as the previous query (le hice decir,
les hagan comprar, etc).

================
THE PROBLEM
================

The problem is, the self-join solution is extremely slow.  I have a SQL
Server 7.0 database with a clustered index on W1 (sequential words) and a
normal index on ID, as well as clustered indexes on X1 (the lemma and part
of speech column) in the POS and lemma tables.  Even with all of this,
however, a self-join query like the one just listed takes about 15 seconds
on my machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).
Using the n-grams approach, however, it takes about one or two seconds.

For all of you SQL gurus out there, am I doing something wrong in terms of
the database architecture of the table on which I do the self-join, or else
in the SQL syntax itself?

As far as I’m aware, this self-join solution -- with all of the words in the
corpus as sequential rows in the database -- is the one used by many other
large corpora.  But I’m guessing that they don’t take 15-20 seconds for a
query involving a three word string like [exact words] + [lemma for one
verb] + [POS, e.g. infinitives].  So they must be doing something different.

Any suggestions?  Thanks in advance.

Mark Davies
Illinois State University

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20021022/f0982185/attachment.htm>


More information about the Corpora mailing list