[Corpora-List] Corpora and SQL

Tue May 22 17:03:45 UTC 2007

Hi Tony,

This is a slightly tricky but solvable problem, depending in part on the
SQL engine you are using (I'm presupposing MySQL for the rest).

The simplest approach is to do a "self join":

  select word from corpustable as a, corpustable as b where a.word = "a"
and b.word = "como" and a.ID = (b.ID+3);

This works, but slowly. It's better to use two temporary table

  drop table temp_table_a;
  drop table temp_table_b;

  create temporary table temp_table_a (x int primary key) type=HEAP;
  create temporary table temp_table_b (y int primary key) type=HEAP;

  insert into temp_table_a(x) select ID+3 from corpustable where word = 'a';

  insert into temp_table_b(y) select ID from corpustable where word =
'como';

  select temp_table_a.x from temp_table_a,temp_table_b where
temp_table_a.x=temp_table_b.y;

(caveat: untested code ...)

Anyway, you should look into a corpus query engine (like CWB); it will
probably save you some trouble in the future.

I should also mention that professor Mark Davies has developed a corpus
query system based on MS SQL Server, but I'm not sure about
availablility of the source code.

best regards,

Lars Nygaard,
University of Oslo

Tony Berber Sardinha wrote:
> Hi everyone
> 
> I wonder if anyone has suggestions of ways of loading a large corpus  
> (about 200 million words) in a SQL table and then searching this  table 
> for words and their collocates?
> 
> The corpus is currently in text format, and looks like this (columns  
> separated by '|' ):
> 
> | 11961 | Revogam            | NN   | <unknown>     |
> | 11962 | -                  | :    | -             |
> | 11963 | se                 | FW   | se            |
> | 11964 | as                 | IN   | as            |
> | 11965 | leis        | NNS  | <unknown>     |
> 
> where column 1 is the record id, column 2 is word, column 3 is tag,  and 
> column 4 is lemma.
> 
> I could use a simple table structure like the one below:
> 
> +-------+-------------+------+-----+---------+----------------+
> | Field | Type        | Null | Key | Default | Extra          |
> +-------+-------------+------+-----+---------+----------------+
> | ID    | int(11)     | NO   | PRI | NULL    | auto_increment |
> | word  | varchar(20) | YES  |     | NULL    |                |
> | tag   | varchar(20) | YES  |     | NULL    |                |
> | lemma | varchar(20) | YES  |     | NULL    |                |
> +-------+-------------+------+-----+---------+----------------+
> 
> but I'm finding it hard to figure out how to search for collocates of  
> 'word' in this table structure (for example where word = "a" and  third 
> collocate to the left = "como").
> 
> Any ideas would be greatly appreciated.
> 
> bye
> 
> tony
> 
> 
> 
>