[Corpora-List] Corpora and SQL
Lars Nygaard
lars.nygaard at iln.uio.no
Wed May 23 09:33:26 UTC 2007
Patrick Drouin wrote:
>
>
> table CORPUS
> +-----------+-------------+------+-----+---------+----------------+
> | Field | Type | Null | Key | Default | Extra |
> +-----------+-------------+------+-----+---------+----------------+
> | ID | int(11) | NO | PRI | NULL | auto_increment |
> | token_ID | int(11) | NO | | NULL | |
> +-----------+-------------+------+-----+---------+----------------+
>
>
> table TOKENS
> +----------+-------------+------+-----+---------+----------------+
> | Field | Type | Null | Key | Default | Extra |
> +----------+-------------+------+-----+---------+----------------+
> | token_ID | int(11) | NO | PRI | NULL | auto_increment |
> | word | varchar(20) | NO | | NULL | |
> | tag | varchar(20) | NO | | NULL | |
> | lemma | varchar(20) | NO | | NULL | |
> +----------+-------------+------+-----+---------+----------------+
>
>
> Now, this will reduce the size of your main table but you might get a
> hit on the speed because you need to join the tables when you query
> (although I doubt it). If your indexes are created correctly, I believe
> this might be faster BUT it would have to be tested.
A third approach would be to use a single table with token offsets in a
BLOB column
+----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+----------------+
| word | varchar(20) | NO | | NULL | |
| tag | varchar(20) | NO | | NULL | |
| lemma | varchar(20) | NO | | NULL | |
| offsets | BLOB | NO | | NULL | |
+----------+-------------+------+-----+---------+----------------+
(the offsets column would then consist of a sequence of integers,
indicating the distance from the start of the corpus). This would, I
believe be closer to the standard approach for text indexing (cf.
Witten/Moffat/Bell).
best,
lars nygaard
More information about the Corpora
mailing list