[Corpora-List] Corpora and SQL
Tony Berber Sardinha
tony at corpuslg.org
Tue May 22 16:30:46 UTC 2007
Hi everyone
I wonder if anyone has suggestions of ways of loading a large corpus
(about 200 million words) in a SQL table and then searching this
table for words and their collocates?
The corpus is currently in text format, and looks like this (columns
separated by '|' ):
| 11961 | Revogam | NN | <unknown> |
| 11962 | - | : | - |
| 11963 | se | FW | se |
| 11964 | as | IN | as |
| 11965 | leis | NNS | <unknown> |
where column 1 is the record id, column 2 is word, column 3 is tag,
and column 4 is lemma.
I could use a simple table structure like the one below:
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| word | varchar(20) | YES | | NULL | |
| tag | varchar(20) | YES | | NULL | |
| lemma | varchar(20) | YES | | NULL | |
+-------+-------------+------+-----+---------+----------------+
but I'm finding it hard to figure out how to search for collocates of
'word' in this table structure (for example where word = "a" and
third collocate to the left = "como").
Any ideas would be greatly appreciated.
bye
tony
More information about the Corpora
mailing list