[Corpora-List] Google releases their database of N-grams
John F. Sowa
sowa at bestweb.net
Fri Aug 4 21:50:41 UTC 2006
Google, one of the world's biggest data collectors anywhere, is
releasing their collection of 5-grams as freely available data.
Anyone who is interested in doing research on techniques that
use N-grams can now wallow in an ocean of data.
Following is an excerpt from the Google announcement.
John Sowa
__________________________________________________________________
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Google Research
All Our N-gram are Belong to You
8/03/2006 11:26:00 AM
Posted by Alex Franz and Thorsten Brants,
Google Machine Translation Team
Here at Google Research we have been using word n-gram models for a
variety of R&D projects, such as statistical machine translation, speech
recognition, spelling correction, entity detection, information
extraction, and others. While such models have usually been estimated
from training corpora containing at most a few billion words, we have
been harnessing the vast power of Google's datacenters and distributed
processing infrastructure to process larger and larger training corpora.
We found that there's no data like more data, and scaled up the size of
our data by one order of magnitude, and then another, and then one more
- resulting in a training corpus of one trillion words from public Web
pages.
We believe that the entire research community can benefit from access to
such massive amounts of data. It will advance the state of the art, it
will focus research in the promising direction of large-scale,
data-driven approaches, and it will allow all research groups, no matter
how large or small their computing resources, to play together. That's
why we decided to share this enormous dataset with everyone. We
processed 1,011,582,453,213 words of running text and are publishing the
counts for all 1,146,580,664 five-word sequences that appear at least 40
times. There are 13,653,070 unique words, after discarding words that
appear less than 200 times.
Watch for an announcement at the LDC, who will be distributing it soon,
and then order your set of 6 DVDs. And let us hear from you - we're
excited to hear what you will do with the data, and we're always
interested in feedback about this dataset, or other potential datasets
that might be useful for the research community.
More information about the Corpora
mailing list