[Corpora-List] UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Sat Sep 2 07:52:51 UTC 2006


Hello Again,

**
IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED 
VERSION. SEE THE NOTE BELOW.
**

  A "thank you" to all the folk who downloaded the first version of our 
USENET word list. Some people made requests for a larger list of types, 
not restricted to my original dictionary. I have now finished the list 
of all types with frequency greater than 3 tokens/million tokens. It is 
large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of 
the types in this list are URLs, e-mail addresses and other cruft that 
are artifacts of my overly simplistic text processing (delete 
punctuation, and split on whitespace.)

I know this list is not for everyone, but if you are interested in 
seeing a lot of types, please download the file from here, and please 
send me any feedback you have. I sorted the list by decreasing type 
frequency.
 
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

WARNING: File size is 28 Mb, compressed

**
NOTE: In doing this run, I noticed that my corpus grew in size from 5.9 
to 7.8 billion words, despite the fact that I was using the same raw 
data. I then discovered my bug: I forgot to count non-words in my 
original program. So if you downloaded the original list of 111,627 
words, the corpus size and freq/million numbers are WRONG! The counts 
were correct, though. Please download the corrected list here (914k, 
compressed):

http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html

I also sorted this list by decreasing frequency for ease of use.

Thanks for your understanding,

Cyrus

=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



More information about the Corpora mailing list