[Corpora-List] Word frequencies for a large corpus of recent USENET text

Thu Aug 31 19:14:33 UTC 2006

Hi Cyrus

a) Is the list in any particular order?

>Number of words: 5894564637
>WORD    COUNT   FREQPERMILLION
>BESTING 712     0.120789242946086
>PRACTICABLY     98      0.0166254856863995
>BANTERERS       2       0.00033929562625305
>RECLOTHE        89      0.0150986553682607

b) Why are some items given a score of 0?

>CYCLIZES        0       0

>PROCEEDERS      0       0

>DATEDLY 0       0
>TUTOYERED       0       0

c) This means that this cannot be a corpus frequency list, but a 
pre-existing wordlist
with corpus frequencies attached?

d) If so, where did the original list come from? Is it a list used 
for psycholinguistic recognition
of 'real words' and 'pseudo-words' or something like that?

e) You mention 111,627 English words; another indication that this is 
not the entire corpus frequency list,
nor the 'most frequent 111,627 types in the corpus' (as some have a 
frequency of 0).

f) If the corpus size is 5,894,564,637 tokens, the entire list cannot 
contain only 111,627 types.
The Bank of English corpus in 1993 contained 120,362,928 tokens, and 
475,633 types;
in 2000, it contained 418,449,873 tokens and 938,914 types. So a 
corpus of 5,894,564,637 tokens
must contain a much larger number of types?

Best
Ramesh

At 17:46 31/08/2006, you wrote:
>Hi All,
>I thought that this might be of interest to the list. I have also 
>experimented with using a CC Attribution-NonCommercial-NoDerivs 
>license for this word frequency list. Please tell me if you think 
>this is a good or a bad idea.
>
>Thanks,
>Cyrus
>
>
>*******
>Announcement: Word frequencies for a large corpus of USENET text released.
>*******
>The Westbury Lab at the University of Alberta does research on lexical
>semantics and other areas of psycholinguistics. Recently, as part of a
>research program investigating high-dimensional models of semantic 
>memory, they collected 5,894,564,637 words from 47,860 English 
>language, non-binary-file newsgroups from the
>USENET between October 2005 and August 2006. This list of 
>orthographic frequencies for 111,627 English words will be
>of use to anyone who has used older lists based on corpora from decades
>past.
>The list is available for download (3.3 MB file) under a Creative
>Commons 2.5 license at:
>     http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html
>
>
>=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
>Cyrus Shaoul
>http://www.psych.ualberta.ca/~westburylab/
>University of Alberta
>780-492-5843
>=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
>
>
>
>

Ramesh Krishnamurthy

Lecturer in English Studies, School of Languages and Social Sciences, 
Aston University, Birmingham B4 7ET, UK
[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ; 
Fax: +44 (0)121-204-3766
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp

Project Leader, ACORN (Aston Corpus Network): http://corpus.aston.ac.uk/ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060831/49c882a2/attachment.htm>