[Corpora-List] Corpora and SQL

Adam Kilgarriff adam at lexmasterclass.com
Thu May 24 15:46:19 UTC 2007


Of course a dedicated corpus query tool does everything well without extra
engineering.  When I see a discussion like this with lots of comments like
"with a bit of effort", I think "how many person-hours do they mean? (And,
how good a solution will it be?) Unless person-hours are very cheap, it will
cost less to buy a service that already does what is wanted."  But, seeing
as I have such a service to sell, I'd better stop there or I shall be thrown
off the list for being commercial

Adam
http://www.kilgarriff.co.uk 

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Lars Nygaard
Sent: 23 May 2007 14:30
To: corpora at uib.no
Subject: Re: [Corpora-List] Corpora and SQL



John D. Burger wrote:

>> Another possibility is to store metadata in a SQL database, and  
>> export, on the fly, a subcorpus definition (start and stop  positions) 
>> for CWB. The best of both worlds, so to speak. This  works very well 
>> for the Glossa corpus query system (which has a  combination of CWB 
>> and MySQL as a backend).
> 
> 
> The disadvantage of this is that a single engine cannot reason about  
> the best way to run your query.  Like other databases, Postgresql  keeps 
> various summary statistics about the distribution of values in  each 
> indexed column, and uses these to construct a (hopefully)  optimal query 
> plan.

Yes, indeed. For single-word queries you could, with a bit of effort, 
probably outperform CWB with an SQL-based system. It gets quite 
unpredicable for more complex queries, however, and I suspect the 
advantages of a single engine can easily be drowned (cf. examples here: 
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQPTutorial/html/no
de13.html). 


It would certainly be interesting, though, if anyone where up to the 
challenge of implementing the full range of features found in IMS CQP 
with an SQL backend.

best,
lars nygaard



More information about the Corpora mailing list