[Corpora-List] Corpora and SQL

Wed May 23 08:09:16 UTC 2007

>> We built a 250 million words corpus in spanish, implemented with MS SQL
Server 2000. We use the text retrieval engine from Microsoft (called "Text
Services" in SQL Server) that is very fast. The full database (data and
indexes) occupies 2 GB.

I use MS SQL Server as well. The downside of using "(Full-)Text Services" in SQL Server is that you're only searching raw text -- exact words and phrases, with very, very limited substring support (e.g. no leading wildcards). With Text Services alone, there is really no way to work with a corpus that has been annotated (POS, lemma), e.g.:

	[pn_obj] querer.* [v_inf]	lo quiero hacer, nos querían hablar, etc

That's why I use real SQL for most queries. It's still fast, it still handles large corpora, but you can have real linguistic annotation on the corpora. 

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================