<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD><TITLE>RE: [Corpora-List] Legal aspects of corpora compiling</TITLE>

<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<META content="MSHTML 6.00.2715.400" name=GENERATOR></HEAD>

<BODY>

<DIV><FONT size=2>I’m experimenting with a change to the architecture of 

a <SPAN class=390462403-24102002>100 million word </SPAN>corpus of Spanish 

that I’ve created (</FONT><A href="http://www.corpusdelespanol.org"><FONT 

size=2>www.corpusdelespanol.org</FONT></A><FONT size=2>)<SPAN 

class=390462403-24102002>.  With my current solution, things work pretty 

well.  Nevertheless, for some reasons listed below, I'm considering a 

change to another database architecture, but the queries with the new approach 

are too slow.  </SPAN>Here’s <SPAN class=390462403-24102002>an 

overview:</SPAN></FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>

<DIV><SPAN class=390462403-24102002><FONT 

size=2>================</FONT></SPAN></DIV>CURRENT APPROACH</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>

<DIV><SPAN class=390462403-24102002><FONT 

size=2>================</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002></SPAN> </DIV></FONT></SPAN></DIV>

<DIV><FONT size=2>I use large databases of n-grams that contain all of the 

distinct 1,2,3,4-grams in the entire 100 million word corpus<SPAN 

class=390462403-24102002>, along with their frequency in several historical 

periods and registers of Modern Spanish</SPAN>.  <SPAN 

class=390462403-24102002>Using this approach, a</SPAN> typical query looks 

like the following:</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>select top 300 * from x3 where <BR>w1 in ('le', 'les') <BR>and 

<BR>w2 in (select w1 from x_L where x1 = 'hacer') <BR>and <BR>w3 in (select w1 

from x_c where x1 = 'v_inf') <BR>order by x19 DESC</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>where w1,w2, and w3 are the three word slots in x3 (a table 

containing all of the 40+ million distinct 3-grams in the corpus), and x_L and 

x_c are two separate tables containing the lemma or part of speech (column x1) 

for about 1,000,000 distinct forms (w1).  <SPAN 

class=390462403-24102002>( Note: in order to run this query, the user 

submits [ le/les  hacer.*  *.v_inf ] (without the 

brackets)).</SPAN></FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>The preceding example, then, selects all of the distinct three 

word strings composed of [le or les] + [a form of hacer "to do"] + [an 

infinitive], e.g. le hice decir, les hagan comprar, etc., and then sorts them by 

their frequency in the 1900s (column x19).</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>This approach has a number of advantages, including the 

ability to search directly by frequency in a given sub-corpus (i.e. all of the 

matching strings that occur more than three times in the 1900s, but no times in 

the 1700s or 1800s), and it is also very fast -- about one or two seconds 

for <SPAN class=390462403-24102002>rather </SPAN>complicated searches like 

this.</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>The downside, and the reason I’m considering changing the 

architecture of the database, is that every time I add or delete a text from the 

corpus, I have to re-calculate all of the 1,2,3,4-grams tables, which is quite 

time-consuming.  In addition, if I want to allow search within a five word 

window, for example, I then have to go back and create a new table with all of 

the distinct 5-grams in the entire corpus, which also takes a long 

time.</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><SPAN class=390462403-24102002><FONT 

size=2>================</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>NEW 

APPROACH</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>

<DIV><SPAN class=390462403-24102002><FONT 

size=2>================</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002></SPAN> </DIV></FONT></SPAN></DIV>

<DIV><FONT size=2>So what I’ve been trying to do is to use a new database that 

is composed of all of the sequential words in the corpus.  For a 100 

million word corpus, for example, there would be 100 million rows with the two 

columns: ID (offset value) and W1 (word), e.g.</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>1  I<BR>2  saw<BR>3  the<BR>4 man<BR> . . 

. <BR>100000000 xxx</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>To extract strings, I would then use self-joins on this one 

table, in which [ID], [ID-1], [ID+1] etc are used to find preceding and 

following words, e.g.:</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>select count(*),w1.w1,w2.w1,w3.w1 from<BR>((select w1, ID+1 as 

ID from seq where w1 in ('le',’les’)) w1<BR>inner join<BR>(select w1, ID as ID 

from seq where w1 in (select w1 from x_c where x1 = ‘hacer’)) w2<BR>on w2.ID = 

w1.ID)<BR>inner join<BR>(select w1, ID-1 as ID from seq where w1 in (select w1 

from x_l where x1 = 'ver')) w3<BR>on w3.ID=w1.ID<BR>group by 

w1.w1,w2.w1,w3.w1</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>The will produce the same results set as the previous query 

(le hice decir, les hagan comprar, etc).</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>

<DIV><SPAN class=390462403-24102002><FONT 

size=2>================</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>THE PROBLEM</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>

<DIV><SPAN class=390462403-24102002><FONT 

size=2>================</FONT></SPAN></DIV>

<DIV><SPAN class=390462403-24102002></SPAN> </DIV></FONT></SPAN></DIV>The 

problem is, the self-join solution is extremely slow.  I have a SQL Server 

7.0 database with a clustered index on W1 (sequential words) and a normal index 

on ID, as well as clustered indexes on X1 (the lemma and part of speech column) 

in the POS and lemma tables.  </FONT><FONT size=2>Even with all of this, 

however, a self-join query like the one just listed takes about 15 seconds on my 

machine (dual CPU 1.2GHz, 4GB RAM, three 10K rpm SCSI HD w/ RAID-0).  Using 

the n-grams approach, however, it takes about one or two seconds.</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>For all of you SQL gurus out there, am I doing something wrong 

in terms of the database architecture of the table on which I do the self-join, 

or else in the SQL syntax itself?  </FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>As far as I’m aware, this self-join solution -- with all of 

the words in the corpus as sequential rows in the database -- is the one used by 

many other large corpora.  But I’m guessing that they don’t take 15-20 

seconds for a query involving a three word string like [exact words] + [lemma 

for one verb] + [POS, e.g. infinitives].  So they must be doing something 

different.</FONT></DIV>

<DIV><FONT size=2></FONT> </DIV>

<DIV><FONT size=2>Any suggestions?<SPAN class=390462403-24102002>  Thanks 

in advance.</SPAN></FONT></DIV>

<DIV><FONT size=2><SPAN class=390462403-24102002></SPAN></FONT> </DIV>

<DIV><FONT size=2><SPAN class=390462403-24102002>Mark Davies</SPAN></FONT></DIV>

<DIV>

<DIV>

<DIV><SPAN class=390462403-24102002><FONT size=2>Illinois State 

University</FONT></SPAN></DIV>

<DIV><FONT size=2></FONT> </DIV></DIV></DIV></BODY></HTML>