[Corpora-List] Google releases their database of N-grams

John F. Sowa sowa at bestweb.net
Fri Aug 4 21:50:41 UTC 2006


Google, one of the world's biggest data collectors anywhere, is
releasing their collection of 5-grams as freely available data.
Anyone who is interested in doing research on techniques that
use N-grams can now wallow in an ocean of data.

Following is an excerpt from the Google announcement.

John Sowa
__________________________________________________________________

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Google Research

All Our N-gram are Belong to You

8/03/2006 11:26:00 AM
Posted by Alex Franz and Thorsten Brants,
Google Machine Translation Team

Here at Google Research we have been using word n-gram models for a 
variety of R&D projects, such as statistical machine translation, speech 
recognition, spelling correction, entity detection, information 
extraction, and others. While such models have usually been estimated 
from training corpora containing at most a few billion words, we have 
been harnessing the vast power of Google's datacenters and distributed 
processing infrastructure to process larger and larger training corpora. 
We found that there's no data like more data, and scaled up the size of 
our data by one order of magnitude, and then another, and then one more 
- resulting in a training corpus of one trillion words from public Web 
pages.

We believe that the entire research community can benefit from access to 
such massive amounts of data. It will advance the state of the art, it 
will focus research in the promising direction of large-scale, 
data-driven approaches, and it will allow all research groups, no matter 
how large or small their computing resources, to play together. That's 
why we decided to share this enormous dataset with everyone. We 
processed 1,011,582,453,213 words of running text and are publishing the 
counts for all 1,146,580,664 five-word sequences that appear at least 40 
times. There are 13,653,070 unique words, after discarding words that 
appear less than 200 times.

Watch for an announcement at the LDC, who will be distributing it soon, 
and then order your set of 6 DVDs. And let us hear from you - we're 
excited to hear what you will do with the data, and we're always 
interested in feedback about this dataset, or other potential datasets 
that might be useful for the research community.



More information about the Corpora mailing list