[Corpora-List] UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.

Adam Kilgarriff adam at lexmasterclass.com
Sat Sep 2 21:40:09 UTC 2006


Just a comment about this kind of resource: wouldn't it be better to make it
available as a searchable resource, allowing people to specify the searches
they wanted and check up on anomalous frequencies, rather than distributing
a frequency list which will inevitably raise many questions, for anyone
planning to seriously use it, which they won't be able to answer (at least
not without coming back to you, and their questions won't be your priority)

Adam

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Cyrus Shaoul
Sent: 02 September 2006 08:53
To: corpora at uib.no
Subject: [Corpora-List] UPDATE: Corrected Word frequencies for a large
corpus of recent USENET text, and full list of types.

Hello Again,

**
IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED 
VERSION. SEE THE NOTE BELOW.
**

  A "thank you" to all the folk who downloaded the first version of our 
USENET word list. Some people made requests for a larger list of types, 
not restricted to my original dictionary. I have now finished the list 
of all types with frequency greater than 3 tokens/million tokens. It is 
large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of 
the types in this list are URLs, e-mail addresses and other cruft that 
are artifacts of my overly simplistic text processing (delete 
punctuation, and split on whitespace.)

I know this list is not for everyone, but if you are interested in 
seeing a lot of types, please download the file from here, and please 
send me any feedback you have. I sorted the list by decreasing type 
frequency.
 
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html

WARNING: File size is 28 Mb, compressed

**
NOTE: In doing this run, I noticed that my corpus grew in size from 5.9 
to 7.8 billion words, despite the fact that I was using the same raw 
data. I then discovered my bug: I forgot to count non-words in my 
original program. So if you downloaded the original list of 111,627 
words, the corpus size and freq/million numbers are WRONG! The counts 
were correct, though. Please download the corrected list here (914k, 
compressed):

http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html

I also sorted this list by decreasing frequency for ease of use.

Thanks for your understanding,

Cyrus

=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}



More information about the Corpora mailing list