[Corpora-List] Web service for Web1T?

Thu Oct 25 21:51:56 UTC 2007

On Thu, 25 Oct 2007, lec3jrw at leeds.ac.uk <lec3jrw at leeds.ac.uk> wrote:

Only joined the list recently so this might have been discussed
previously.

> Back-of-an-envelope calculations suggest to me that, with judicious use of
> hashes, you ought to be able to squash the entire database down to less than
> 30GB; ...

Assuming the 14Gb is base data it is possble to get the requirement lower
than that. There is, at least, one text retrieval system that reguarly
squashed the indices down to less than unity; indeed on large document
collections I have seen the indexing owverhead down to 0.4. It uses hashes
for the term-lists. Thought I should delcare a quasi-commercial interest
in that I worked for the producers (including in the R&D group) for neigh
on 10 years.

> ...but you'd only realistically be able to query it on the frequency of a
> given n-gram... ...

The system has full colocation pointers (document, field, paragraph,
sentence, word-within-sentence) within that 0.4 overhead. Plus left- and
right-truncation of terms.

> ... which was the original task suggested I think. If you wanted to ask
> "which n-grams have frequency x", or "how many n-grams contain the word
> y at position z" the indices get bigger than that very quickly.

Not necessarily. If you take a look at Witten, Moffat, and Bell's
"Managing Gigabytes" the trick is to compress the position positions in
the term lists. Easy then to get overheads less than unity.

> ... But again, judicious use of hashes might go a long way. There might
> be some mileage to be had by using a tree structure whereby each 4-gram,
> for example, is stored as a reference to two 2-grams; that might achieve
> some compromise on index size versus query time.

As it happens the text retrieval system with hashes replaced an earlier
one that used trees (a B-tree variant). Not only did the sotrage
requirement go down with hashes so did the precessing time. Data
structures 101 says that a hash is O(1) whilst a B-tree is O(n log n); I'd
always go for O(1) myself.

> As an aside - it also seems to me that throwing away those n-grams with fewer
> than 40 occurrences might not be as useful, for many tasks, as throwing away
> those whose frequencies are within a given threshold of what would be expected
> by chance, i.e. given the known frequencies of their component (<n)-grams.

System in question throw nothing away; didn't even have a stop-word list.
(A deliberate decision by the way.)

Regards, Trevor

<>< Re: deemed!

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora