[Corpora-List] UPDATE: Corrected Word frequencies for a large corpus of recent USENET text, and full list of types.
Adam Kilgarriff
adam at lexmasterclass.com
Sat Sep 2 21:40:09 UTC 2006
Just a comment about this kind of resource: wouldn't it be better to make it
available as a searchable resource, allowing people to specify the searches
they wanted and check up on anomalous frequencies, rather than distributing
a frequency list which will inevitably raise many questions, for anyone
planning to seriously use it, which they won't be able to answer (at least
not without coming back to you, and their questions won't be your priority)
Adam
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Cyrus Shaoul
Sent: 02 September 2006 08:53
To: corpora at uib.no
Subject: [Corpora-List] UPDATE: Corrected Word frequencies for a large
corpus of recent USENET text, and full list of types.
Hello Again,
**
IMPORTANT: IF YOU DOWNLOADED THE ORIGINAL LIST, PLEASE GET THE CORRECTED
VERSION. SEE THE NOTE BELOW.
**
A "thank you" to all the folk who downloaded the first version of our
USENET word list. Some people made requests for a larger list of types,
not restricted to my original dictionary. I have now finished the list
of all types with frequency greater than 3 tokens/million tokens. It is
large (28 Mb, compressed), with 5,609,086 types. Unfortunately most of
the types in this list are URLs, e-mail addresses and other cruft that
are artifacts of my overly simplistic text processing (delete
punctuation, and split on whitespace.)
I know this list is not for everyone, but if you are interested in
seeing a lot of types, please download the file from here, and please
send me any feedback you have. I sorted the list by decreasing type
frequency.
http://www.psych.ualberta.ca/~westburylab/downloads/wlallfreq.download.html
WARNING: File size is 28 Mb, compressed
**
NOTE: In doing this run, I noticed that my corpus grew in size from 5.9
to 7.8 billion words, despite the fact that I was using the same raw
data. I then discovered my bug: I forgot to count non-words in my
original program. So if you downloaded the original list of 111,627
words, the corpus size and freq/million numbers are WRONG! The counts
were correct, though. Please download the corrected list here (914k,
compressed):
http://www.psych.ualberta.ca/~westburylab/downloads/wlfreq.download.html
I also sorted this list by decreasing frequency for ease of use.
Thanks for your understanding,
Cyrus
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
Cyrus Shaoul
http://www.psych.ualberta.ca/~westburylab/
University of Alberta
780-492-5843
=[=]={=}=[=]={=}=[=]={=}=[=]={=}=[=]={=}
More information about the Corpora
mailing list