[Corpora-List] Corpora and SQL
William Sakas
sakas at hunter.cuny.edu
Thu May 24 17:01:35 UTC 2007
Hi there,
As someone else has noted, these sorts of queries are notoriously slow in a
relational database system. As a general rule of thumb, when 'linear'
information
about rows (e.g., distance between rows) is needed to preform a calculation,
SQL
is not the way to go; especially if the calculation can be done on one pass
through a file. It would be an easy program to write in Perl, Python, etc.
Any particular reason you're wedded to SQL?
-- Wm
On 5/22/07, Tony Berber Sardinha <tony at corpuslg.org> wrote:
>
> Hi everyone
>
> I wonder if anyone has suggestions of ways of loading a large corpus
> (about 200 million words) in a SQL table and then searching this
> table for words and their collocates?
>
> The corpus is currently in text format, and looks like this (columns
> separated by '|' ):
>
> | 11961 | Revogam | NN | <unknown> |
> | 11962 | - | : | - |
> | 11963 | se | FW | se |
> | 11964 | as | IN | as |
> | 11965 | leis | NNS | <unknown> |
>
> where column 1 is the record id, column 2 is word, column 3 is tag,
> and column 4 is lemma.
>
> I could use a simple table structure like the one below:
>
> +-------+-------------+------+-----+---------+----------------+
> | Field | Type | Null | Key | Default | Extra |
> +-------+-------------+------+-----+---------+----------------+
> | ID | int(11) | NO | PRI | NULL | auto_increment |
> | word | varchar(20) | YES | | NULL | |
> | tag | varchar(20) | YES | | NULL | |
> | lemma | varchar(20) | YES | | NULL | |
> +-------+-------------+------+-----+---------+----------------+
>
> but I'm finding it hard to figure out how to search for collocates of
> 'word' in this table structure (for example where word = "a" and
> third collocate to the left = "como").
>
> Any ideas would be greatly appreciated.
>
> bye
>
> tony
>
>
>
>
>
--
William Gregory Sakas
Associate Professor of Computer Science and Linguistics
Hunter College and the Graduate Center
City University of New York (CUNY)
Email: sakas at hunter.cuny.edu
Voice: 1 212 772.5211
Fax: 1 212 772.5219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070524/16aaa498/attachment.htm>
More information about the Corpora
mailing list