Corpora: Using a relational database to store conc pointers

Thu Mar 30 07:37:39 UTC 2000

Dear colleagues,

Does anybody have any experience of using a relational database to store 
index information for a concordance service?

I'm building a test interface for the Bank of Finnish and plan to store 
pointers to specific locations in the corpus in a database column, e.g. 
something like 344:2555 would point to corpus file number 344, byte 
position 2555.

The obvious problem is how one should handle common words, as every 
occurence of a specific type needs a pointer of its own. So, if the 
frequency of some common word is, say 50,000 this would generate 50,000 
pointers as well. Putting these in one field in a column seems to be 
rather foolish. Does anybody know how to avoid this?

All comments are welcome.

Thanks,

Mickel Grönroos
Helsinki

www.ling.helsinki.fi/~mcgronro/  | Mickel.Gronroos at helsinki.fi
---------------------------------|----------------------------
Inst. för allmän språkvetenskap  | Dep. of General Linguistics
PB 4 (Fabiansgatan 28)           |  tfn/phone +358-9-191 22707
FI-00014 Helsingfors universitet |        fax +358-9-191 23598