[Corpora-List] Google 1T set (results)

Miles Osborne miles at inf.ed.ac.uk
Tue Nov 20 14:22:47 UTC 2007


A while back I mentioned that we would look at representing the Google ngram
set using a Bloom Filter.

Here are some sample results.  We used  2 Gb of space (three hash functions,
error rate of about 10%) and threw away count information.

The filter itself can always tell you when you previously stored an ngram.
But for entries that were not stored in the table:

>
serve as there insurer  0
sarkozy sarkozy sarkozy 0
ZZZZX zxzxzx rareta     0mein name ish trudyyyy  0
bvcxc can't sphelle     0
truant officers can dance       0
ceramics community fore 0
ceramics community four 0
99999999999999999999999999      0
RUN! martians are here! 1  (*****error)
cermaics composed       0
duo core quad core pentium      0
serve the instructional institution     0
the vodka is strong     0
>

So, error rates at these levels may be acceptable for some applications.
But the thing which amazes me is the ability to answer a query in just three
hash functions.

Thanks to Abby Levenberg for doing the experiments.

Miles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071120/29ee9fa8/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list