[Corpora-List] Google 1T set (results)
Miles Osborne
miles at inf.ed.ac.uk
Tue Nov 20 14:22:47 UTC 2007
A while back I mentioned that we would look at representing the Google ngram
set using a Bloom Filter.
Here are some sample results. We used 2 Gb of space (three hash functions,
error rate of about 10%) and threw away count information.
The filter itself can always tell you when you previously stored an ngram.
But for entries that were not stored in the table:
>
serve as there insurer 0
sarkozy sarkozy sarkozy 0
ZZZZX zxzxzx rareta 0mein name ish trudyyyy 0
bvcxc can't sphelle 0
truant officers can dance 0
ceramics community fore 0
ceramics community four 0
99999999999999999999999999 0
RUN! martians are here! 1 (*****error)
cermaics composed 0
duo core quad core pentium 0
serve the instructional institution 0
the vodka is strong 0
>
So, error rates at these levels may be acceptable for some applications.
But the thing which amazes me is the ability to answer a query in just three
hash functions.
Thanks to Abby Levenberg for doing the experiments.
Miles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071120/29ee9fa8/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list