As a fun exercise, we are going to encode all of the Google release in a Bloom Filter and see how that goes. We were about to publish a web front-end to this, but given the licensing, that doesn't look like a viable option.
<br><br>For the interested, we had a pair of papers on this kind of thing at ACL and EMNLP this year:<br><br><b>David Talbot; Miles Osborne</b><br><i>Randomised Language Modelling for Statistical Machine Translation<br><br>
</i><a href="http://acl.ldc.upenn.edu/P/P07/P07-1065.pdf">http://acl.ldc.upenn.edu/P/P07/P07-1065.pdf</a><br><br><b>David Talbot; Miles Osborne</b><br><i>Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap<br>
<span style="font-style: italic;"><br></span><a href="http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf">http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf</a><br></i><br>Miles<br>