[Corpora-List] Corpora and SQL

Mark Davies Mark_Davies at byu.edu
Thu May 24 17:46:14 UTC 2007


>> As someone else has noted, these sorts of queries are notoriously
slow in a
relational database system. As a general rule of thumb, when 'linear'
information
about rows (e.g., distance between rows) is needed to preform a
calculation, 

The operative term being "general rule of thumb". As I've mentioned
before, you can have fast queries (anywhere from a small fraction of a
second to a handful of seconds, depending on the string) on a 100+
million word corpus. Just don't use self-joins, which is typically what
people try when they first get into SQL. That would be slow, but there
are many other approaches that are hundreds of times as fast (multiple
clustered indexes, (clustered indexes on) temp tables, etc)
 
>> SQL is not the way to go; especially if the calculation can be done
on one pass
through a file.

If SQL works as fast or faster than a linear pass through a file, why
not use it? 

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 
 



More information about the Corpora mailing list