[Corpora-List] Corpora and SQL

William Sakas sakas at hunter.cuny.edu
Thu May 24 17:01:35 UTC 2007


Hi there,

As someone else has noted, these sorts of queries are notoriously slow in a
relational database system. As a general rule of thumb, when 'linear'
information
about rows (e.g., distance between rows) is needed to preform a calculation,
SQL
is not the way to go; especially if the calculation can be done on one pass
through a file. It would be an easy program to write in Perl, Python, etc.
Any particular reason you're wedded to SQL?

-- Wm

On 5/22/07, Tony Berber Sardinha <tony at corpuslg.org> wrote:
>
> Hi everyone
>
> I wonder if anyone has suggestions of ways of loading a large corpus
> (about 200 million words) in a SQL table and then searching this
> table for words and their collocates?
>
> The corpus is currently in text format, and looks like this (columns
> separated by '|' ):
>
> | 11961 | Revogam            | NN   | <unknown>     |
> | 11962 | -                  | :    | -             |
> | 11963 | se                 | FW   | se            |
> | 11964 | as                 | IN   | as            |
> | 11965 | leis        | NNS  | <unknown>     |
>
> where column 1 is the record id, column 2 is word, column 3 is tag,
> and column 4 is lemma.
>
> I could use a simple table structure like the one below:
>
> +-------+-------------+------+-----+---------+----------------+
> | Field | Type        | Null | Key | Default | Extra          |
> +-------+-------------+------+-----+---------+----------------+
> | ID    | int(11)     | NO   | PRI | NULL    | auto_increment |
> | word  | varchar(20) | YES  |     | NULL    |                |
> | tag   | varchar(20) | YES  |     | NULL    |                |
> | lemma | varchar(20) | YES  |     | NULL    |                |
> +-------+-------------+------+-----+---------+----------------+
>
> but I'm finding it hard to figure out how to search for collocates of
> 'word' in this table structure (for example where word = "a" and
> third collocate to the left = "como").
>
> Any ideas would be greatly appreciated.
>
> bye
>
> tony
>
>
>
>
>


-- 
William Gregory Sakas
Associate Professor of Computer Science and Linguistics
Hunter College and the Graduate Center
City University of New York (CUNY)
Email:   sakas at hunter.cuny.edu
Voice:  1 212 772.5211
Fax:      1 212 772.5219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070524/16aaa498/attachment.htm>


More information about the Corpora mailing list