[Corpora-List] Corpora and SQL

Wed May 23 09:33:26 UTC 2007

Patrick Drouin wrote:

> 
> 
> table CORPUS
> +-----------+-------------+------+-----+---------+----------------+
> | Field     | Type        | Null | Key | Default | Extra          |
> +-----------+-------------+------+-----+---------+----------------+
> | ID        | int(11)     | NO   | PRI | NULL    | auto_increment |
> | token_ID  | int(11)     | NO  |      | NULL    |                |
> +-----------+-------------+------+-----+---------+----------------+
> 
> 
> table TOKENS
> +----------+-------------+------+-----+---------+----------------+
> | Field    | Type        | Null | Key | Default | Extra          |
> +----------+-------------+------+-----+---------+----------------+
> | token_ID | int(11)     | NO   | PRI | NULL    | auto_increment |
> | word     | varchar(20) | NO   |     | NULL    |                |
> | tag      | varchar(20) | NO   |     | NULL    |                |
> | lemma    | varchar(20) | NO   |     | NULL    |                |
> +----------+-------------+------+-----+---------+----------------+
> 
> 
> Now, this will reduce the size of your main table but you might get a 
> hit on the speed because you need to join the tables when you query 
> (although I doubt it). If your indexes are created correctly, I believe 
> this might be faster BUT it would have to be tested.

A third approach would be to use a single table with token offsets in a 
BLOB column

+----------+-------------+------+-----+---------+----------------+
| Field    | Type        | Null | Key | Default | Extra          |
+----------+-------------+------+-----+---------+----------------+
| word     | varchar(20) | NO   |     | NULL    |                |
| tag      | varchar(20) | NO   |     | NULL    |                |
| lemma    | varchar(20) | NO   |     | NULL    |                |
| offsets  | BLOB        | NO   |     | NULL    |                |
+----------+-------------+------+-----+---------+----------------+

(the offsets column would then consist of a sequence of integers, 
indicating the distance from the start of the corpus). This would, I 
believe be closer to the standard approach for text indexing (cf. 
Witten/Moffat/Bell).

best,
lars nygaard