[Corpora-List] Corpora and SQL

Thu May 24 12:00:21 UTC 2007

>> Yes, indeed. For single-word queries you could, with a bit of effort,

probably outperform CWB with an SQL-based system. It gets quite 
unpredicable for more complex queries, however, and I suspect the 
advantages of a single engine can easily be drowned.

I use a purely SQL approach with (for example) the VIEW interface to the
BNC (http://view.byu.edu) or the 100 million word TIME corpus
(http://view.byu.edu/timemag) , and it seems to handle "complex" queries
quite well -- less than two or three seconds for a query like " white
[nn*] that ". I've used CWB, but it doesn't seem to be any faster for a
query like this on a large (e.g. 100+ million word query) -- should it
be? In addition, a true SQL approach allows nice functionality in terms
of limiting and comparing by sub-corpora (directly, as part of the
query; see help files and examples at these two website). 

>> It would certainly be interesting, though, if anyone where up to the 
challenge of implementing the full range of features found in IMS CQP 
with an SQL backend.

Each approach has its advantages and disadvantages. Just as a purely SQL
approach may not do everything that CWB can, I'm sure the converse is
also true.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================