[Corpora-List] Corpora and SQL

Wed May 23 07:45:32 UTC 2007

Tony,

>> I wonder if anyone has suggestions of ways of loading a large corpus

(about 200 million words) in a SQL table and then searching this  
table for words and their collocates?

I use a SQL-only solution for several large corpora that I've placed
online, including:

BNC (VIEW) (100m words) http://view.byu.edu
TIME corpus (new May 2007; 100m words; US 1900s)
http://view.byu.edu/timemag
Corpus del Espanol (100m words) http://www.corpusdelespanol.org
Corpus do Portuguese (45m words) http://www.corpusdoportugues.org

I do basic KWIC, collocates (e.g. 10 words L/R of node word), n-grams
searches, etc with the architecture I've developed. One of the nice
things about the SQL approach is that you can also limit and compare by
frequency in sub-sections of the corpus -- by time period or by
register, for example. Finally, the queries are nearly all quite fast --
1-2 seconds for most queries, even on a 100m word corpus.

The architecture relies on a number of different tables -- one with the
clustered index on the ID/offset column, one with clustered index on
middle word in a 5-gram table, plus temp tables. In addition, these
tables can be linked to others like WordNet, user-defined wordlists,
basic frequency indices, etc. All of this gives a fairly robust (and
fast) architecture, I think.

I hope this helps.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================