[Corpora-List] UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.
Cyrus Shaoul
cyrus.shaoul at ualberta.ca
Sat Sep 2 07:52:51 UTC 2006
Hello Again,
**
IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED
VERSION. SEE THE NOTE BELOW.
**
A "thank you" to all the folk who downloaded the first version of our
USENET word list. Some people made requests for a larger list of types,
not restricted to my original dictionary. I have now finished the list
of all types with frequency greater than 3 tokens/million tokens. It is
large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of
the types in this list are URLs, e-mail addresses and other cruft that
are artifacts of my overly simplistic text processing (delete
punctuation, and split on whitespace.)
I know this list is not for everyone, but if you are interested in
seeing a lot of types, please download the file from here, and please
send me any feedback you have. I sorted the list by decreasing type
frequency.
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html
WARNING: File size is 28 Mb, compressed
**
NOTE: In doing this run, I noticed that my corpus grew in size from 5.9
to 7.8 billion words, despite the fact that I was using the same raw
data. I then discovered my bug: I forgot to count non-words in my
original program. So if you downloaded the original list of 111,627
words, the corpus size and freq/million numbers are WRONG! The counts
were correct, though. Please download the corrected list here (914k,
compressed):
http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html
I also sorted this list by decreasing frequency for ease of use.
Thanks for your understanding,
Cyrus
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
More information about the Corpora
mailing list