[Corpora-List] Corpora and SQL

Tue May 22 18:55:57 UTC 2007

John D. Burger wrote:
>> Is this not a performance nightmare?  A table with 200 million  entries?
> 
> 
> Many databases are routinely used for far larger datasets.
> 
> With respect to the original query, the industrial-strength DB  Postgres 
> has a well-developed extension for text search called tsearch2:
> 
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Can you use it to query columnar data, as in the example in the original 
posting?

> One virtue of using real databases rather than text retrieval engines  
> is the ability to query both document content and whatever metadata  one 
> might have associated with the text.  "Find me blog entries with  these 
> words posted on Saturday evenings by authors whose profile says  they 
> were born before 1964 and are interested in sushi."

Another possibility is to store metadata in a SQL database, and export, 
on the fly, a subcorpus definition (start and stop positions) for CWB. 
The best of both worlds, so to speak. This works very well for the 
Glossa corpus query system (which has a combination of CWB and MySQL as 
a backend).

-lars