[Corpora-List] Corpora and SQL

William Sakas sakas at hunter.cuny.edu
Thu May 24 17:04:44 UTC 2007


Ooops. Sorry about the last post, someone forwarded me the question "naked"
and after checking
out the thread, seems like there are already a lot of very good comments.

Best,
-- Wm

On 5/22/07, Tony Berber Sardinha <tony at corpuslg.org> wrote:
>
> Hi everyone
>
> I wonder if anyone has suggestions of ways of loading a large corpus
> (about 200 million words) in a SQL table and then searching this
> table for words and their collocates?
>
> The corpus is currently in text format, and looks like this (columns
> separated by '|' ):
>
> | 11961 | Revogam            | NN   | <unknown>     |
> | 11962 | -                  | :    | -             |
> | 11963 | se                 | FW   | se            |
> | 11964 | as                 | IN   | as            |
> | 11965 | leis        | NNS  | <unknown>     |
>
> where column 1 is the record id, column 2 is word, column 3 is tag,
> and column 4 is lemma.
>
> I could use a simple table structure like the one below:
>
> +-------+-------------+------+-----+---------+----------------+
> | Field | Type        | Null | Key | Default | Extra          |
> +-------+-------------+------+-----+---------+----------------+
> | ID    | int(11)     | NO   | PRI | NULL    | auto_increment |
> | word  | varchar(20) | YES  |     | NULL    |                |
> | tag   | varchar(20) | YES  |     | NULL    |                |
> | lemma | varchar(20) | YES  |     | NULL    |                |
> +-------+-------------+------+-----+---------+----------------+
>
> but I'm finding it hard to figure out how to search for collocates of
> 'word' in this table structure (for example where word = "a" and
> third collocate to the left = "como").
>
> Any ideas would be greatly appreciated.
>
> bye
>
> tony
>
>
>
>
>


-- 
William Gregory Sakas
Associate Professor of Computer Science and Linguistics
Hunter College and the Graduate Center
City University of New York (CUNY)
Email:   sakas at hunter.cuny.edu
Voice:  1 212 772.5211
Fax:      1 212 772.5219
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070524/d87b1d17/attachment.htm>


More information about the Corpora mailing list