[Corpora-List] Web service for Web1T?

lec3jrw at leeds.ac.uk lec3jrw at leeds.ac.uk
Fri Oct 26 08:49:35 UTC 2007


Trevor,

>> Back-of-an-envelope calculations suggest to me that, with judicious use of
>> hashes, you ought to be able to squash the entire database down to less
>> than 30GB; ...

> Assuming the 14Gb is base data it is possble to get the requirement lower
> than that. There is, at least, one text retrieval system that reguarly
> squashed the indices down to less than unity; indeed on large document
> collections I have seen the indexing owverhead down to 0.4. It uses hashes
> for the term-lists. Thought I should delcare a quasi-commercial interest
> in that I worked for the producers (including in the R&D group) for neigh
> on 10 years.

I think the base data is far larger than 14Gb (see Iain's following reply); 14Gb
is in a heavily compressed form, without any kind of indexing.

> As it happens the text retrieval system with hashes replaced an earlier
> one that used trees (a B-tree variant). Not only did the sotrage
> requirement go down with hashes so did the precessing time. Data
> structures 101 says that a hash is O(1) whilst a B-tree is O(n log n); I'd
> always go for O(1) myself.

I meant to imply a combination of hash and tree, in which nodes are referenced
by hashes for very quick traversal, as a compromise between the advantages of
both - but I haven't really thought it through.

Thank you for the references - I shall look into these!

Justin Washtell


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list