[Corpora-List] Corpora and SQL

Tue May 22 16:30:46 UTC 2007

Hi everyone

I wonder if anyone has suggestions of ways of loading a large corpus  
(about 200 million words) in a SQL table and then searching this  
table for words and their collocates?

The corpus is currently in text format, and looks like this (columns  
separated by '|' ):

| 11961 | Revogam            | NN   | <unknown>     |
| 11962 | -                  | :    | -             |
| 11963 | se                 | FW   | se            |
| 11964 | as                 | IN   | as            |
| 11965 | leis        | NNS  | <unknown>     |

where column 1 is the record id, column 2 is word, column 3 is tag,  
and column 4 is lemma.

I could use a simple table structure like the one below:

+-------+-------------+------+-----+---------+----------------+
| Field | Type        | Null | Key | Default | Extra          |
+-------+-------------+------+-----+---------+----------------+
| ID    | int(11)     | NO   | PRI | NULL    | auto_increment |
| word  | varchar(20) | YES  |     | NULL    |                |
| tag   | varchar(20) | YES  |     | NULL    |                |
| lemma | varchar(20) | YES  |     | NULL    |                |
+-------+-------------+------+-----+---------+----------------+

but I'm finding it hard to figure out how to search for collocates of  
'word' in this table structure (for example where word = "a" and  
third collocate to the left = "como").

Any ideas would be greatly appreciated.

bye

tony