Corpora: SQL Server as an option for large, fast web-based corpora

Mon Dec 4 19:30:20 UTC 2000

For the last three or four years I've been looking for a program to allow
me to provide access to large (100+ million word) corpora via the Web, and
provide users with fast (<10 sec) queries and KWIC-formatted
output.  Virtually all of the commercially available "web-indexing"
packages are designed to return a list of web pages with the desired
content, but they provide no way to extract and list just the relevant
sections of each web page in KWIC format.  I am also aware of several
PC-based solutions that are designed to provide KWIC-format output, but
these are for use on a local workstation, and have not yet been
(completely) modified to provide Web access.

Within the last three or four months I've developed a schema that does
allow access to corpora via the Web, and this approach involves:
   -- SQL Server (including the new "full-text searching option in 7.0)
   -- ADO (Active Data Objects), and
   -- ASP scripts (Active Server Pages, using VBScript).
This schema allows fairly fast access (<10 seconds for most queries) to
large corpora (~180-200 million words).  The output is displayed in KWIC
format, and the results can further be sorted by left or right context words.

Examples of these corpora can be found at:

http://mdavies.for.ilstu.edu/corpus		3 million word corpus of historical
Spanish
http://mdavies.for.ilstu.edu/corpus/publico	180+ million word corpus of
Modern Portuguese

The one major shortcoming of this approach is that it is limited by the
(overly-restricted) native search syntax of the "full-text" search engine
in SQL Server, which serves as the backbone for the corpora.  While it is
possible to do wildcard OR proximity searches (e.g. 1-3 intervening words),
it is not possible to combine these two types of queries.  In addition, it
is not possible to do left-branching wildcard queries (*ing, *ization,
etc).  Using script-based serial queries (which would be opaque to the end
user), however, it should be possible to replicate most of these more
advanced queries.

In addition to a more robust search syntax, there are other improvements
such as more options for output (# words and sorting) that I could/should
integrate into the corpora.  But for right now I think they still provide
some indication of what can be done.

At any rate, I'm sending this to CORPORA simply to get feedback from those
who are working on similar approaches for PC/NT-based web-accessible
corpora.  I'd appreciate any comments that you might have.

Mark Davies
Illinois State University

=======================================
Mark Davies, Associate Professor, Spanish Linguistics
Dept. of Foreign Languages, Illinois State University
Normal, IL 61790-4300

Voice:309/438-7975      email:mdavies at ilstu.edu
Fax:309/438-8038          http://mdavies.for.ilstu.edu/
=======================================