<html>
<body>
Hi Cyrus<br><br>
a) Is the list in any particular order?<br><br>
<blockquote type=cite class=cite cite="">
<font face="Courier New, Courier">Number of words: 5894564637<br>
WORD<x-tab> </x-tab>COUNT<x-tab>
</x-tab>FREQPERMILLION<br>
BESTING<x-tab> </x-tab>712<x-tab>
</x-tab>0.120789242946086<br>
PRACTICABLY<x-tab> </x-tab>98<x-tab>
</x-tab>0.0166254856863995<br>
BANTERERS<x-tab> </x-tab>2<x-tab>
</x-tab>0.00033929562625305<br>
RECLOTHE<x-tab> </x-tab>
89<x-tab> </x-tab>
0.0150986553682607</font></blockquote><br>
b) Why are some items given a score of 0?<br><br>
<blockquote type=cite class=cite cite="">
<font face="Courier New, Courier">CYCLIZES<x-tab>
</x-tab>0<x-tab>
</x-tab>0</font></blockquote>
<br>
<blockquote type=cite class=cite cite="">
<font face="Courier New, Courier">PROCEEDERS<x-tab>
</x-tab>0<x-tab>
</x-tab>0</font></blockquote>
<br>
<blockquote type=cite class=cite cite="">
<font face="Courier New, Courier">DATEDLY<x-tab> </x-tab>0<x-tab>
</x-tab>0<br>
TUTOYERED<x-tab> </x-tab>0<x-tab>
</x-tab>0</font></blockquote>
<br>
c) This means that this cannot be a corpus frequency list, but a
pre-existing wordlist<br>
with corpus frequencies attached?<br><br>
d) If so, where did the original list come from? Is it a list used for
psycholinguistic recognition<br>
of 'real words' and 'pseudo-words' or something like that?<br><br>
e) You mention 111,627 English words; another indication that this is not
the entire corpus frequency list, <br>
nor the 'most frequent 111,627 types in the corpus' (as some have a
frequency of 0).<br><br>
f) If the corpus size is 5,894,564,637 tokens, the entire list cannot
contain only 111,627 types.<br>
The Bank of English corpus in 1993 contained 120,362,928 tokens, and
475,633 types;<br>
in 2000, it contained 418,449,873 tokens and 938,914 types. So a corpus
of 5,894,564,637 tokens<br>
must contain a much larger number of types?<br><br>
Best<br>
Ramesh<br><br>
At 17:46 31/08/2006, you wrote:<br>
<blockquote type=cite class=cite cite="">Hi All, <br>
I thought that this might be of interest to the list. I have also
experimented with using a CC Attribution-NonCommercial-NoDerivs license
for this word frequency list. Please tell me if you think this is a good
or a bad idea.<br><br>
Thanks, <br>
Cyrus<br><br>
<br>
*******<br>
Announcement: Word frequencies for a large corpus of USENET text
released.<br>
*******<br>
The Westbury Lab at the University of Alberta does research on
lexical<br>
semantics and other areas of psycholinguistics. Recently, as part of
a<br>
research program investigating high-dimensional models of semantic
memory, they collected 5,894,564,637 words from 47,860 English language,
non-binary-file newsgroups from the<br>
USENET between October 2005 and August 2006. This list of orthographic
frequencies for 111,627 English words will be<br>
of use to anyone who has used older lists based on corpora from
decades<br>
past.<br>
The list is available for download (3.3 MB file) under a Creative<br>
Commons 2.5 license at:<br>
<a href="http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html" eudora="autourl">
http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html</a>
<br>
<br><br>
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}<br>
Cyrus Shaoul<br>
<a href="http://www.psych.ualberta.ca/~westburylab/" eudora="autourl">
http://www.psych.ualberta.ca/~westburylab/</a><br>
University of Alberta<br>
780-492-5843<br>
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}<br><br>
<br><br>
<br>
</blockquote>
<x-sigsep><p></x-sigsep>
Ramesh Krishnamurthy<br><br>
Lecturer in English Studies, School of Languages and Social Sciences,
Aston University, Birmingham B4 7ET, UK<br>
[Room NX08, North Wing of Main Building] ; Tel: +44 (0)121-204-3812 ;
Fax: +44 (0)121-204-3766<br>
<a href="http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp" eudora="autourl">
http://www.aston.ac.uk/lss/staff/krishnamurthyr.jsp<br><br>
</a>Project Leader, ACORN (Aston Corpus Network):
<a href="http://corpus.aston.ac.uk/" eudora="autourl">
http://corpus.aston.ac.uk/</a></body>
</html>