Corpora: Summary: Relational databases and conc pointers
Mickel Grönroos
mcgronro at ling.helsinki.fi
Tue Apr 4 07:14:16 UTC 2000
Dear colleagues,
Thanks to Chris Brew, Alexander Clark, David Graff, Jochen Leidner, Oliver
Mason, Manne Miettinen, Mika Rissanen, Tylman Ule and Tom Vanallemeersch
(alphabetically listed) for useful tips and a fruitful discussion on how to
combine a relational database with a concordancer/collocator.
The suggestions varied somewhat, but mainly they can be divided into the
following two groups:
1 All information is stored in the database
2 Type specific information is stored in the database, with
pointers to a pointer list containing the information needed
for token lookup
I'll try to explain the two approaches lightly:
1 The first approach makes extensive use of the database architecture. For
each token in the corpus you generate a row in a database table, e.g.
something like this:
| tokenId | typeId | file | byteOffset |
|---------+--------+------+-------------|
| 120 | 1 | 12 | 1443 |
This says that the 120th occurence of the word type numbered 1 is found in
file number 12, starting at byte position 1443.
Seems rather straightforward, doesn't it? Well, it still raises the question
if it is sensible to store say 50,000 rows in a db table for just one high
frequency wordform (since each token in the corpus generates a row of its
own in the pointer table).
Databases are of course intended to handle tables with several million rows,
so technically this should be possible to implement, as long as the corpus
being indexed does not contain half a billion words or so ... But still, is
it sensible?
2 The second approach takes into consideration that it is a waste of db
space to store separate records for each and every pointer to the corpus
files. Instead the pointers are stored in an file outside the database. The
database will then contain a table with pointers to this external index
instead, like this:
| idx | byteStart | byteOffset |
+-----+-----------+------------+
| 1 | 1170 | 251 |
This says that the pointers needed for word type number 1 is found in the
index file from byte position 1170 and 251 bytes onward. This information is
then used by the software to fetch the appropriate information from the
index and then from the corpus files.
The elegance in this approach is that more or less identical information do
not have to be stored in a database and above all that the index file can be
compressed. Without compression it is likely that the index file will be of
almost the same size as the corpus itself (as every token will generate a
pointer of its own). With compression the index file will shrink
considerably.
I don't have any experience on using index compression myself, but this
possibility was raised by Chris Brew, Oliver Mason and Tom Vanallemeersch
and it seems rational. See Witten, Moffat & Bell (1994) "Managing Gigabytes:
Compressing and Indexing Documents and Images" or Baeza-Yates & Ribeiro-Neto
"Modern Information Retrieval" (pp 184 ff) for more information.
Thank you for reading.
Cheers,
Mickel Grönroos
University of Helsinki
www.ling.helsinki.fi/~mcgronro/ | Mickel.Gronroos at helsinki.fi
---------------------------------|----------------------------
Inst. för allmän språkvetenskap | Dep. of General Linguistics
PB 4 (Fabiansgatan 28) | tfn/phone +358-9-191 22707
FI-00014 Helsingfors universitet | fax +358-9-191 23598
More information about the Corpora
mailing list