[Corpora-List] Corpora and SQL
Lars Nygaard
lars.nygaard at iln.uio.no
Tue May 22 17:03:45 UTC 2007
Hi Tony,
This is a slightly tricky but solvable problem, depending in part on the
SQL engine you are using (I'm presupposing MySQL for the rest).
The simplest approach is to do a "self join":
select word from corpustable as a, corpustable as b where a.word = "a"
and b.word = "como" and a.ID = (b.ID+3);
This works, but slowly. It's better to use two temporary table
drop table temp_table_a;
drop table temp_table_b;
create temporary table temp_table_a (x int primary key) type=HEAP;
create temporary table temp_table_b (y int primary key) type=HEAP;
insert into temp_table_a(x) select ID+3 from corpustable where word = 'a';
insert into temp_table_b(y) select ID from corpustable where word =
'como';
select temp_table_a.x from temp_table_a,temp_table_b where
temp_table_a.x=temp_table_b.y;
(caveat: untested code ...)
Anyway, you should look into a corpus query engine (like CWB); it will
probably save you some trouble in the future.
I should also mention that professor Mark Davies has developed a corpus
query system based on MS SQL Server, but I'm not sure about
availablility of the source code.
best regards,
Lars Nygaard,
University of Oslo
Tony Berber Sardinha wrote:
> Hi everyone
>
> I wonder if anyone has suggestions of ways of loading a large corpus
> (about 200 million words) in a SQL table and then searching this table
> for words and their collocates?
>
> The corpus is currently in text format, and looks like this (columns
> separated by '|' ):
>
> | 11961 | Revogam | NN | <unknown> |
> | 11962 | - | : | - |
> | 11963 | se | FW | se |
> | 11964 | as | IN | as |
> | 11965 | leis | NNS | <unknown> |
>
> where column 1 is the record id, column 2 is word, column 3 is tag, and
> column 4 is lemma.
>
> I could use a simple table structure like the one below:
>
> +-------+-------------+------+-----+---------+----------------+
> | Field | Type | Null | Key | Default | Extra |
> +-------+-------------+------+-----+---------+----------------+
> | ID | int(11) | NO | PRI | NULL | auto_increment |
> | word | varchar(20) | YES | | NULL | |
> | tag | varchar(20) | YES | | NULL | |
> | lemma | varchar(20) | YES | | NULL | |
> +-------+-------------+------+-----+---------+----------------+
>
> but I'm finding it hard to figure out how to search for collocates of
> 'word' in this table structure (for example where word = "a" and third
> collocate to the left = "como").
>
> Any ideas would be greatly appreciated.
>
> bye
>
> tony
>
>
>
>
More information about the Corpora
mailing list