Corpora: Summary: Relational databases and conc pointers

Tue Apr 4 07:14:16 UTC 2000

Dear colleagues,

Thanks to Chris Brew, Alexander Clark, David Graff, Jochen Leidner, Oliver 
Mason, Manne Miettinen, Mika Rissanen, Tylman Ule and Tom Vanallemeersch 
(alphabetically listed) for useful tips and a fruitful discussion on how to 
combine a relational database with a concordancer/collocator.

The suggestions varied somewhat, but mainly they can be divided into the 
following two groups:

1	All information is stored in the database

2	Type specific information is stored in the database, with 
	pointers to a pointer list containing the information needed
	for token lookup

I'll try to explain the two approaches lightly:

1 The first approach makes extensive use of the database architecture. For 
each token in the corpus you generate a row in a database table, e.g. 
something like this:

	| tokenId | typeId | file | byteOffset 	|
	|---------+--------+------+-------------|
	|     120 |      1 |   12 |       1443	|

This says that the 120th occurence of the word type numbered 1 is found in 
file number 12, starting at byte position 1443.

Seems rather straightforward, doesn't it? Well, it still raises the question 
if it is sensible to store say 50,000 rows in a db table for just one high 
frequency wordform (since each token in the corpus generates a row of its 
own in the pointer table). 

Databases are of course intended to handle tables with several million rows, 
so technically this should be possible to implement, as long as the corpus 
being indexed does not contain half a billion words or so ... But still, is 
it sensible?

2 The second approach takes into consideration that it is a waste of db 
space to store separate records for each and every pointer to the corpus 
files. Instead the pointers are stored in an file outside the database. The 
database will then contain a table with pointers to this external index 
instead, like this:

	| idx | byteStart | byteOffset |
	+-----+-----------+------------+
	|   1 |      1170 |        251 |

This says that the pointers needed for word type number 1 is found in the 
index file from byte position 1170 and 251 bytes onward. This information is 
then used by the software to fetch the appropriate information from the 
index and then from the corpus files.

The elegance in this approach is that more or less identical information do 
not have to be stored in a database and above all that the index file can be 
compressed. Without compression it is likely that the index file will be of 
almost the same size as the corpus itself (as every token will generate a 
pointer of its own). With compression the index file will shrink 
considerably. 

I don't have any experience on using index compression myself, but this 
possibility was raised by Chris Brew, Oliver Mason and Tom Vanallemeersch 
and it seems rational. See Witten, Moffat & Bell (1994) "Managing Gigabytes: 
Compressing and Indexing Documents and Images" or Baeza-Yates & Ribeiro-Neto 
"Modern Information Retrieval" (pp 184 ff) for more information. 

Thank you for reading.

Cheers,

Mickel Grönroos
University of Helsinki

www.ling.helsinki.fi/~mcgronro/  | Mickel.Gronroos at helsinki.fi
---------------------------------|----------------------------
Inst. för allmän språkvetenskap  | Dep. of General Linguistics
PB 4 (Fabiansgatan 28)           |  tfn/phone +358-9-191 22707
FI-00014 Helsingfors universitet |        fax +358-9-191 23598