A while back I mentioned that we would look at representing the Google ngram set using a Bloom Filter.<br><br>Here are some sample results. We used 2 Gb of space (three hash functions, error rate of about 10%) and threw away count information.
<br><br>The filter itself can always tell you when you previously stored an ngram. But for entries that were not stored in the table:<br><br>><br>
serve as there insurer 0<br>sarkozy sarkozy sarkozy 0
<br>ZZZZX zxzxzx rareta 0<div class="Ih2E3d">mein name ish trudyyyy 0<br>bvcxc can't sphelle 0<br>truant officers can dance 0<br>ceramics community fore 0<br></div>ceramics community four 0<br>99999999999999999999999999 0
<br>RUN! martians are here! 1 (*****error)<br><div class="Ih2E3d">cermaics composed 0<br>duo core quad core pentium 0<br>serve the instructional institution 0<br></div>the vodka is strong 0<br>><br>
<br>So, error rates at these levels may be acceptable for some applications. But the thing which amazes me is the ability to answer a query in just three hash functions.<br><br>Thanks to Abby Levenberg for doing the experiments.
<br><br>Miles<br>