[Corpora-List] Corpora and SQL

Patrick Drouin pat at traductik.org
Tue May 22 19:34:40 UTC 2007


Hello Tony,

Lars Nygaard has some very good suggestions. I'm adding a couple more 
ideas. You might want to create a few indexes to speed up the query, it 
is sometimes useful to use a tool to see what the database engine is 
doing with your query. Lookup the documentation for the word EXPLAIN, it 
will show you where the bottleneck is in you SELECT statement. If you're 
using mySQL, you can try NAVICAT.

I had a web-based concordancer based on mySQL and I used 2 tables 
instead of 1 in order to reduce the size of the tables. I had something 
that looked like this (I'm doing this from memory as the code is not 
handy at the moment) :


table CORPUS
+-----------+-------------+------+-----+---------+----------------+
| Field     | Type        | Null | Key | Default | Extra          |
+-----------+-------------+------+-----+---------+----------------+
| ID        | int(11)     | NO   | PRI | NULL    | auto_increment |
| token_ID  | int(11)     | NO  |      | NULL    |                |
+-----------+-------------+------+-----+---------+----------------+


table TOKENS
+----------+-------------+------+-----+---------+----------------+
| Field    | Type        | Null | Key | Default | Extra          |
+----------+-------------+------+-----+---------+----------------+
| token_ID | int(11)     | NO   | PRI | NULL    | auto_increment |
| word     | varchar(20) | NO   |     | NULL    |                |
| tag      | varchar(20) | NO   |     | NULL    |                |
| lemma    | varchar(20) | NO   |     | NULL    |                |
+----------+-------------+------+-----+---------+----------------+


Now, this will reduce the size of your main table but you might get a 
hit on the speed because you need to join the tables when you query 
(although I doubt it). If your indexes are created correctly, I believe 
this might be faster BUT it would have to be tested.

I also remember posts from Mark Davies on a similar subject, he seems to 
have found a good solution because his tool is quite quick.

Best regards,
Patrick



More information about the Corpora mailing list