Hi there,<br><br>As someone else has noted, these sorts of queries are notoriously slow in a<br>relational database system. As a general rule of thumb, when 'linear' information<br>about rows (e.g., distance between rows) is needed to preform a calculation, SQL
<br>is not the way to go; especially if the calculation can be done on one pass<br>through a file. It would be an easy program to write in Perl, Python, etc. <br>Any particular reason you're wedded to SQL?<br><br>-- Wm
<br><br><div><span class="gmail_quote">On 5/22/07, <b class="gmail_sendername">Tony Berber Sardinha</b> <<a href="mailto:tony@corpuslg.org">tony@corpuslg.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi everyone<br><br>I wonder if anyone has suggestions of ways of loading a large corpus<br>(about 200 million words) in a SQL table and then searching this<br>table for words and their collocates?<br><br>The corpus is currently in text format, and looks like this (columns
<br>separated by '|' ):<br><br>| 11961 | Revogam | NN | <unknown> |<br>| 11962 | - | : | - |<br>| 11963 | se | FW | se |<br>| 11964 | as | IN | as |
<br>| 11965 | leis | NNS | <unknown> |<br><br>where column 1 is the record id, column 2 is word, column 3 is tag,<br>and column 4 is lemma.<br><br>I could use a simple table structure like the one below:
<br><br>+-------+-------------+------+-----+---------+----------------+<br>| Field | Type | Null | Key | Default | Extra |<br>+-------+-------------+------+-----+---------+----------------+<br>| ID | int(11) | NO | PRI | NULL | auto_increment |
<br>| word | varchar(20) | YES | | NULL | |<br>| tag | varchar(20) | YES | | NULL | |<br>| lemma | varchar(20) | YES | | NULL | |<br>+-------+-------------+------+-----+---------+----------------+
<br><br>but I'm finding it hard to figure out how to search for collocates of<br>'word' in this table structure (for example where word = "a" and<br>third collocate to the left = "como").<br>
<br>Any ideas would be greatly appreciated.<br><br>bye<br><br>tony<br><br><br><br><br></blockquote></div><br><br clear="all"><br>-- <br>William Gregory Sakas<br>Associate Professor of Computer Science and Linguistics<br>Hunter College and the Graduate Center
<br>City University of New York (CUNY)<br>Email: <a href="mailto:sakas@hunter.cuny.edu">sakas@hunter.cuny.edu</a><br>Voice: 1 212 772.5211<br>Fax: 1 212 772.5219