[Corpora-List] word frequencies on the web

radev at umich.edu radev at umich.edu
Fri Dec 8 16:51:25 UTC 2006


Have you seen this release from Google:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13


Introduction

This data set, contributed by Google Inc., contains English word
n-grams and their observed frequency counts. The length of the n-grams
ranges from unigrams (single words) to five-grams. We expect this data
will be useful for statistical language modeling, e.g., for machine
translation or speech recognition, as well as for other uses.

Source Data

The n-gram counts were generated from approximately 1 trillion word
tokens of text from publicly accessible Web pages.



> 
> Dear all, does anyone know of ways to estimate the frequency of words  
> on the web, or if there're search engines that supply this info (as  
> Altavista used to do)?
> 
> thank you!
> tony
> www2.lael.pucsp.br/~tony
> 
> 
> 
> 


-- 
Dragomir R. Radev                    Associate Professor
SI, CSE, Ling                     U. Michigan, Ann Arbor 
http://www.eecs.umich.edu/~radev         radev at umich.edu              



More information about the Corpora mailing list