<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>RE: [Corpora-List] Legal aspects of corpora compiling</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2715.400" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT size=2>I’m experimenting with a change to the architecture of
a <SPAN class=390462403-24102002>100 million word </SPAN>corpus of Spanish
that I’ve created (</FONT><A href="http://www.corpusdelespanol.org"><FONT
size=2>www.corpusdelespanol.org</FONT></A><FONT size=2>)<SPAN
class=390462403-24102002>. With my current solution, things work pretty
well. Nevertheless, for some reasons listed below, I'm considering a
change to another database architecture, but the queries with the new approach
are too slow. </SPAN>Here’s <SPAN class=390462403-24102002>an
overview:</SPAN></FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>
<DIV><SPAN class=390462403-24102002><FONT
size=2>================</FONT></SPAN></DIV>CURRENT APPROACH</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>
<DIV><SPAN class=390462403-24102002><FONT
size=2>================</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002></SPAN> </DIV></FONT></SPAN></DIV>
<DIV><FONT size=2>I use large databases of n-grams that contain all of the
distinct 1,2,3,4-grams in the entire 100 million word corpus<SPAN
class=390462403-24102002>, along with their frequency in several historical
periods and registers of Modern Spanish</SPAN>. <SPAN
class=390462403-24102002>Using this approach, a</SPAN> typical query looks
like the following:</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>select top 300 * from x3 where <BR>w1 in ('le', 'les') <BR>and
<BR>w2 in (select w1 from x_L where x1 = 'hacer') <BR>and <BR>w3 in (select w1
from x_c where x1 = 'v_inf') <BR>order by x19 DESC</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>where w1,w2, and w3 are the three word slots in x3 (a table
containing all of the 40+ million distinct 3-grams in the corpus), and x_L and
x_c are two separate tables containing the lemma or part of speech (column x1)
for about 1,000,000 distinct forms (w1). <SPAN
class=390462403-24102002>( Note: in order to run this query, the user
submits [ le/les hacer.* *.v_inf ] (without the
brackets)).</SPAN></FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>The preceding example, then, selects all of the distinct three
word strings composed of [le or les] + [a form of hacer "to do"] + [an
infinitive], e.g. le hice decir, les hagan comprar, etc., and then sorts them by
their frequency in the 1900s (column x19).</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>This approach has a number of advantages, including the
ability to search directly by frequency in a given sub-corpus (i.e. all of the
matching strings that occur more than three times in the 1900s, but no times in
the 1700s or 1800s), and it is also very fast -- about one or two seconds
for <SPAN class=390462403-24102002>rather </SPAN>complicated searches like
this.</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>The downside, and the reason I’m considering changing the
architecture of the database, is that every time I add or delete a text from the
corpus, I have to re-calculate all of the 1,2,3,4-grams tables, which is quite
time-consuming. In addition, if I want to allow search within a five word
window, for example, I then have to go back and create a new table with all of
the distinct 5-grams in the entire corpus, which also takes a long
time.</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><SPAN class=390462403-24102002><FONT
size=2>================</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>NEW
APPROACH</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>
<DIV><SPAN class=390462403-24102002><FONT
size=2>================</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002></SPAN> </DIV></FONT></SPAN></DIV>
<DIV><FONT size=2>So what I’ve been trying to do is to use a new database that
is composed of all of the sequential words in the corpus. For a 100
million word corpus, for example, there would be 100 million rows with the two
columns: ID (offset value) and W1 (word), e.g.</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>1 I<BR>2 saw<BR>3 the<BR>4 man<BR> . .
. <BR>100000000 xxx</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>To extract strings, I would then use self-joins on this one
table, in which [ID], [ID-1], [ID+1] etc are used to find preceding and
following words, e.g.:</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>select count(*),w1.w1,w2.w1,w3.w1 from<BR>((select w1, ID+1 as
ID from seq where w1 in ('le',’les’)) w1<BR>inner join<BR>(select w1, ID as ID
from seq where w1 in (select w1 from x_c where x1 = ‘hacer’)) w2<BR>on w2.ID =
w1.ID)<BR>inner join<BR>(select w1, ID-1 as ID from seq where w1 in (select w1
from x_l where x1 = 'ver')) w3<BR>on w3.ID=w1.ID<BR>group by
w1.w1,w2.w1,w3.w1</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>The will produce the same results set as the previous query
(le hice decir, les hagan comprar, etc).</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>
<DIV><SPAN class=390462403-24102002><FONT
size=2>================</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>THE PROBLEM</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>
<DIV><SPAN class=390462403-24102002><FONT
size=2>================</FONT></SPAN></DIV>
<DIV><SPAN class=390462403-24102002></SPAN> </DIV></FONT></SPAN></DIV>The
problem is, the self-join solution is extremely slow. I have a SQL Server
7.0 database with a clustered index on W1 (sequential words) and a normal index
on ID, as well as clustered indexes on X1 (the lemma and part of speech column)
in the POS and lemma tables. </FONT><FONT size=2>Even with all of this,
however, a self-join query like the one just listed takes about 15 seconds on my
machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0). Using
the n-grams approach, however, it takes about one or two seconds.</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>For all of you SQL gurus out there, am I doing something wrong
in terms of the database architecture of the table on which I do the self-join,
or else in the SQL syntax itself? </FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>As far as I’m aware, this self-join solution -- with all of
the words in the corpus as sequential rows in the database -- is the one used by
many other large corpora. But I’m guessing that they don’t take 15-20
seconds for a query involving a three word string like [exact words] + [lemma
for one verb] + [POS, e.g. infinitives]. So they must be doing something
different.</FONT></DIV>
<DIV><FONT size=2></FONT> </DIV>
<DIV><FONT size=2>Any suggestions?<SPAN class=390462403-24102002> Thanks
in advance.</SPAN></FONT></DIV>
<DIV><FONT size=2><SPAN class=390462403-24102002></SPAN></FONT> </DIV>
<DIV><FONT size=2><SPAN class=390462403-24102002>Mark Davies</SPAN></FONT></DIV>
<DIV>
<DIV>
<DIV><SPAN class=390462403-24102002><FONT size=2>Illinois State
University</FONT></SPAN></DIV>
<DIV><FONT size=2></FONT> </DIV></DIV></DIV></BODY></HTML>