Corpora: Using a relational database to store conc pointers

Thu Mar 30 08:05:37 UTC 2000

Dear Mickel,

If your files have a reasonable length, then did you consider storing
pointers to files only, and resolving the positions inside files
automatically on look-up?

Of course, the overhead may be too big if tokens have to be identified
on the fly, but I am using this approach with a tokenised corpus, and
speed is o.k.

Hope that helps,
Tylman

Mickel Grönroos wrote:
>
> Does anybody have any experience of using a relational database to store
> index information for a concordance service?
>
> I'm building a test interface for the Bank of Finnish and plan to store
> pointers to specific locations in the corpus in a database column, e.g.
> something like 344:2555 would point to corpus file number 344, byte
> position 2555.
>
> The obvious problem is how one should handle common words, as every
> occurence of a specific type needs a pointer of its own. So, if the
> frequency of some common word is, say 50,000 this would generate 50,000
> pointers as well. Putting these in one field in a column seems to be
> rather foolish. Does anybody know how to avoid this?
>

--
Tylman Ule,  Tel. 07071/29-78490, Fax 07071/550520
	Seminar für Sprachwissenschaft, Universität Tübingen
        Kleine Wilhelmstraße 113, 72074 Tübingen