Corpora: Using a relational database to store conc pointers

Oliver Mason oliver at clg.bham.ac.uk
Fri Mar 31 08:47:20 UTC 2000


If one goes for implementing their own system instead of using a general-
purpose database the definite guide is

Witten, I., Moffat, A., Bell, T. (1994)
  Managing Gigabytes: Compressing and Indexing Documents and Images
  Van Nostrand Reinhold, New York.

Despite its technical topic it is very readable, even for people without
a mathematical background.

<shameless plug>
The CUE system (available from the Birmingham Corpus Research Website, and
also through an application called QWICK on the BNC Sampler and the latest
ICAME CD ROM) is a Java implementation of algorithms described there.  Apart
from just compressing the index, the text is also compressed, which means
that the data size of the fully indexed corpus is below the size of the
uncompressed plain text input file.
</shameless plug>

Oliver Christ pointed that book out to me about five years ago, and I believe
the Stuttgart corpus access system is also based on it, as he was working on
it at the time.

Oliver

--
//\\ computer officer | corpus research | department of english | school of  -
//\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
\\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
\\// mobile 07050 104504 | http://www.clg.bham.ac.uk | o.mason at bham.ac.uk\/  -



More information about the Corpora mailing list