[Corpora-List] Web service for Web1T?

Trevor Jenkins trevor.jenkins at suneidesis.com
Fri Oct 26 09:42:33 UTC 2007


On Fri, 26 Oct 2007, lec3jrw at leeds.ac.uk <lec3jrw at leeds.ac.uk> wrote:

> >> Back-of-an-envelope calculations suggest to me that, with judicious use of
> >> hashes, you ought to be able to squash the entire database down to less
> >> than 30GB; ...
>
> > Assuming the 14Gb is base data it is possble to get the requirement lower
> > than that. ...
>
> I think the base data is far larger than 14Gb (see Iain's following
> reply); 14Gb is in a heavily compressed form, without any kind of
> indexing.

Not being a fan of top-posting (*) I missed the bit with the actual size.
However, the text retrieval system I mentioned can deal that volume of
data; our US office were involved in a "if I tell you I'd have to kill to
you" sale that required multi-terabytes, which back in the early 1990s was
pushing the bounds of hardware and software.

> > As it happens the text retrieval system with hashes replaced an earlier
> > one that used trees (a B-tree variant). Not only did the sotrage
> > requirement go down with hashes so did the precessing time. Data
> > structures 101 says that a hash is O(1) whilst a B-tree is O(n log n); I'd
> > always go for O(1) myself.
>
> I meant to imply a combination of hash and tree, in which nodes are referenced
> by hashes for very quick traversal, as a compromise between the advantages of
> both - but I haven't really thought it through.

Oh you meant for searching through the overflow bucket. Careful selection
of the hash function reduces the possibility of collisions. A lot of
design effort went into creating a hash function that didn't produce many
collisions despite being applied to multi-lingual texts.

> Thank you for the references - I shall look into these!

Witten, I., A. Moffat, and T. Bell (1999) "Managing Gigabytes" (second
edition). San Francisco: Morgan Kaufmann.

It should be read with Knuth's Art of Computer Programming in the other
hand.

(*) As a PWD (person with dyslexia) normal reading works best for me. I
love these .sigs to be see on the 'net.

A: Yes.
Q: Are you sure?
A: Because it reverses the logical flow of the argument.
Q: Why is top-posting so frowned on?

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet?

Regards, Trevor

<>< Re: deemed!



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list