[Corpora-List] Word frequencies for a large corpus of recent USENET text

Cyrus Shaoul cyrus.shaoul at ualberta.ca
Thu Aug 31 20:41:15 UTC 2006


Ramesh Krishnamurthy wrote:
> Hi Cyrus
>
Hi Ramesh.
> a) Is the list in any particular order?
>
No, it is not in any particular order. Feel free to sort it as you wish.
> b) Why are some items given a score of 0?
>
Because they never occurred in the corpus.

> c) This means that this cannot be a corpus frequency list, but a 
> pre-existing wordlist
> with corpus frequencies attached?
>
Sorry, I should have clarified this in my first e-mail. You are correct, 
I used a pre-existing word list that I had made for
other purposes. It does not contain all the words that occurred in the 
corpus.
> d) If so, where did the original list come from? Is it a list used for 
> psycholinguistic recognition
> of 'real words' and 'pseudo-words' or something like that?
>
Yes, that is the type of list it is. It came from various other word 
lists used in psycholinguistics. If you do find any non-words in the 
list, please let me know.
Also, if you can clarify what you mean by pseudo-words, I would 
appreciate it! :-)
> e) You mention 111,627 English words; another indication that this is 
> not the entire corpus frequency list,
> nor the 'most frequent 111,627 types in the corpus' (as some have a 
> frequency of 0).
Indeed, this is correct.
>
> f) If the corpus size is 5,894,564,637 tokens, the entire list cannot 
> contain only 111,627 types.
> The Bank of English corpus in 1993 contained 120,362,928 tokens, and 
> 475,633 types;
> in 2000, it contained 418,449,873 tokens and 938,914 types. So a 
> corpus of 5,894,564,637 tokens
> must contain a much larger number of types?
There are definitely more than 111,627 types in this corpus.

I have a feeling that some CORPORA folk would like to see all the types 
and their frequencies.
If there are any other people who are interested in this data, please 
contact me directly.
I would like to know how many people are interested before I try to find 
all the types in the corpus.

Also, I am working on the list of word bigrams and trigrams, and hope to 
release those one day soon as well. (Any interest from
list members? Write me.)

Yours,

Cyrus

=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}


-------------- next part --------------
A non-text attachment was scrubbed...
Name: cyrus.shaoul.vcf
Type: text/x-vcard
Size: 293 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20060831/16b264d4/attachment-0001.vcf>


More information about the Corpora mailing list