7.583, Sum: Am. English word frequency lists

The Linguist List linguist at tam2000.tamu.edu
Fri Apr 19 13:25:40 UTC 1996


---------------------------------------------------------------------------
LINGUIST List:  Vol-7-583. Fri Apr 19 1996. ISSN: 1068-4875. Lines:  373
 
Subject: 7.583, Sum: Am. English word frequency lists
 
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu> (On Leave)
            T. Daniel Seely: Eastern Michigan U. <dseely at emunix.emich.edu>
 
Associate Editor:  Ljuba Veselinova <lveselin at emunix.emich.edu>
Assistant Editors: Ron Reck <rreck at emunix.emich.edu>
                   Ann Dizdar <dizdar at tam2000.tamu.edu>
                   Annemarie Valdez <avaldez at emunix.emich.edu>
 
Software development: John H. Remmers <remmers at emunix.emich.edu>
 
Editor for this issue: dseely at emunix.emich.edu (T. Daniel Seely)
 
---------------------------------Directory-----------------------------------
1)
Date:  Fri, 19 Apr 1996 08:19:19 EDT
From:  LBHNDP at ritvax.isc.rit.edu ("L. HILLMAN")
Subject:  Sum: Am.  English word frequency lists
 
---------------------------------Messages------------------------------------
1)
Date:  Fri, 19 Apr 1996 08:19:19 EDT
From:  LBHNDP at ritvax.isc.rit.edu ("L. HILLMAN")
Subject:  Sum: Am.  English word frequency lists
 
In a recent request to LINGUIST, I asked for word frequency
lists for American English.  I am grateful to all of you
for your help.   Lou Hillman lbhndp at rit.edu
 
 
In addition to the quoted responses, the following people
also suggested:
 
     Frequency Analysis of English Usage, Lexicon and Grammar
     by W. Nelson Francis and Henry Kucera
 
in its various guises.
 
        MARC PICARD <PICARD at vax2.concordia.ca>
        Guillaume Gantard <ggantard at logos-usa.com>
        Judith Parker <jparker at s850.mwc.edu>
 
Here are excerpts from the other responses.
- ---------------------------------------------------------------
From: patrick.juola at psy.ox.ac.uk (Patrick Juola)
 
There are several professionally compiled lists of several million
words, sorted by frequency in various corpora -- I know that the
Brown corpus (Kucera & Francis, a zillion years ago) is available
on-line from UPenn if you know whom to ask.
 
*BUT* having answered your question, please please please please
please let me warn you away from trusting any of the answers you
receive -- as a professional corpus linguist, you're going to have
some *serious* sampling effects in any corpus of that size.  A rough
check on the Brown histogram reveals that the 20,000th word is
"bombproof", with a frequency of three per million text tokens.  The
inclusion or exclusion of a single page of text in the Brown corpus
would be enough to add or remove a word from the list (as a rough
test, I just opened a copy of a book and confirmed that both the
words "cliques" [rank 41505 in the Brown corpus] and "subgraphs"
[did not appear] occured three times on that page.
 
The implications are fairly obvious -- the lists that you get are
very sensitive to the corpora from which they are drawn, and
particularly to the style, language, and content of the corpora --
so a list compiled from six million words of newspaper articles is
likely to be significantly and substantially different from a list
compiled from six million words of USENET postings, which in turn
will be completely different from six million words of magazines,
&c.
 
- ---------------------------------------------------------------
From: john.beaven at sharp.co.uk (John Beaven)
 
Not exactly an answer to your question, but as a last resort you
could always "roll your own" by running this Unix script on your
favourite multi-million word corpus...
 
#! /bin/sh -
# finds the word frequenceies in a text and sorts them by decreasing
# order (ie most frequent word at top)
awk -e '{print " " $0}' $1 | deroff -w | sort | uniq -c | sort -nr
 
- ---------------------------------------------------------------
From: Evan.Antworth at SIL.ORG (Evan L. Antworth)
 
Go to this address:
 
    gopher://gopher.sil.org/11/gopher_root/linguistics/info/
 
and look at the items titled "English word frequencies...". (I didn't
create these lists; I just got them from an FTP site at Vassar.)
 
[see below for FTP address.  LBH]
 
- ---------------------------------------------------------------
From: cball at guvax.acc.georgetown.edu (Catherine N. Ball)
 
I think you can find many frequency lists for American English in
the library -- for example, Francis and Kucera published one based
on their (now famous) corpus of American English known as the 'Brown
Corpus' (which is available from the Oxford Text Archive).  You can
also make your own frequency list using simple software.  I recently
made a 'Web Frequency Indexer' which allows you to paste in your
text and get a frequency list -- I will be modifying it soon to
allow the user to simply give the name of a file on their own
computer.  Anyhow, you might find it useful.  The URL is
        http://www.georgetown.edu/cball/webtools/web_freqs.html
 
- ---------------------------------------------------------------------
From: meador at U.Arizona.EDU (Diane L Meador)
 
I have available, through my web page at the URL below, an American
English lexical database, "Phondic".  It's packaged with "Sample", a
program written by Emmanual Dupoux (CNRS, Paris), which searches the
database by several criteria, such as stress and syllable patterns,
phonemic or orthographic strings, etc.  One of the options is
frequency.  While I have never tried to sort by frequency, I don't
imagine that it would be difficult to do so.
 
I hope that this meets your needs.  If you do decide to use it, I
ask on behalf of Emmanual Dupoux that he is given acknowledgment
credit.  The program has DOS and Unix versions.  Follow the
"Available Papers" link on my page; it's listed under "Miscellany".
 
        http://aruba.ccit.arizona.edu/~meador
 
- ---------------------------------------------------------------------
From: ms2928 at liverpool.ac.uk (Mike Scott)
 
Do you have a particular corpus in mind? The kind of 40,000
list will be pretty dependent on the corpus you use.
 
For example, I have done a word list on the UK newspaper the
Guardian, and without lemmatising, 4 million tokens will give rise
to about 85,000 word types. 10 million might give about 120,000 and
100 million gives about 250,000.
 
I have produced a word lister (etc.) available via
        http://www.liv.ac.uk/~ms2928/homepage.html
        http://www1.oup.co.uk/oup/elt/software/wsmith?
 
(published by Oxford
Univ. Press) The software costs UK sterling 49 (about 75 US$) and does
a lot more than just word listing.  If you visit the OUP site you'll see
sample screens to show the idea.
 
Alternatively there are existing word lists in paper format:
McGraw Hill have one, there's Francis & Kucera, and presumably
the Brown corpus of the 60s will be in machine-readable format
too.
 
- ---------------------------------------------------------------------
[Vera Kempe posted a similar request several months ago and sent
the following message, which she forwarded to me.  Some information is
repeated from above; I have edited briefly.  LBH]
 
From: VKEMPE at UOFT02.UTOLEDO.EDU
 
For all those who have asked me to share the responses on my query
about computerized word frequency lists - here is what I got so far.
 
Good luck!
 
- Vera Kempe
Department of Psychology
University of Toldeo
vkempe at uoft02.utoledo.edu
 
 
 
From:   IN%"PICARD at VAX2.CONCORDIA.CA"  "MARC PICARD"
To:     IN%"vkempe at uoft02.utoledo.edu"
CC:
Subj:   Frequency count
 
I don't have Francis & Kucera but I do have LOB and KWC.  Let me know if
you're interested and I'll send them along.
 
Marc Picard
____________________________________________________
 
From:   IN%"C.J.Gledhill at aston.ac.uk"
To:     IN%"vkempe at uoft02.utoledo.edu"
CC:
Subj:
 
Write to Birmingham University's Cobuild, a corpus-based
lexicographic project: direct at cobuild.collins.co.uk
 
 
 
Chris J Gledhill
Lecturer in French
Languages and European Studies
Aston University
BIRMINGHAM B4 7ET
c.j.gledhill at aston.ac.uk
____________________________________________________
 
From:   IN%"griffith at kula.usp.ac.fj"  "Patrick Griffiths"
 
Dear Dr Kempe
 
In the Journal of Child Language, 1994(2), 513-6, there is a review
by George Dunbar of Philip Quinlan OXFORD PSYCHOLINGUISTIC DATABASE.
Oxford University Press, 1992.  This is a package of computer
software, for Macintosh.  The reviewer says (p. 513): "The database
contains entries for over 98,000 words, with information on up to 26
properties of each.  This includes information on physical
properties, such as the length of the word, other objective
properties, such as its frequency of occurrence in the
Kucera-Francis list, and subjective or 'psychological' properties,
such as imageability ratings."
 
A single user licence was priced at 205 British pounds, which
corresponds to somewhere between 300 and 400 US dollars, I think.
 
Best wishes
 
Patrick
____________________________________________________
 
From:   IN%"edwards at cogsci.Berkeley.EDU"
To:     IN%"vkempe at uoft02.utoledo.edu"
CC:     IN%"edwards at cogsci.Berkeley.EDU"
Subj:   word frequencies online
The site below has a couple, with documentation available there.
Hope this helps,
-Jane Edwards
- -------------------------------------------------------------------
 
From: veronis at vassar.edu (Jean Veronis)
 
Comme je l'ai signale dans un precedent message la liste des frequences
dans le Brown Corpus est disponible dans le domaine public, comme partie
de la base de donnee MRC. Toutefois, pour faciliter la tache de ceux qui
sont interesses par ces seules frequences, je viens de placer la liste
des mots les plus frequents plus de 10 occurrences dans le Brown Corpus.
La comparaison serait interessante avec la liste des 5000 mots frequents
dans le Wall Street Journal, mise a disposition par Ken Church.
 
ftp             : vaxsar.vassar.edu ou 143.226.1.6
user            : anonymous
password        : votre nom
Sous-directory  : nlp
____________________________________________________
From:   IN%"jem at cobuild.collins.co.uk"  "Jem Clear"
To:     IN%"vkempe at uoft02.utoledo.edu"
CC:
Subj:   Word frequencies
 
We do indeed have word frequency lists drawn from our extensive corpora
of modern English language.  Do you know about Cobuild?  (If not, have a
look at our WWW site at URL
  http://titania.cobuild.collins.co.uk/
for more information.)
 
Briefly, we have a 20-million word corpus accessible via a subscription
service called CobuildDirect.  These 20m samples are taken from our main
"Bank of English" corpus of 211m words (as at time of writing -- we keep
adding more to it).
 
We receive many requests like yours, so we have recently decided to make
some sort of standard tariff for providing frequency lists.  Here it is:
 
- --------------------------------------------
 
1. Complete lemmatised 20m freq list
  a.  (incl. infl forms, POS, freqs)            150
2. 10,000 most freq lemma from 20m
  a.  (lemmas + POS)                            100
  b.  (with freqs)                              120
  c.  (with infl forms + freqs)                 150
 
 
 
3. 10,000 most freq lemma from 211m
  a.  (lemmas + POS)                            500
  b.  (with freqs)                              600
  c.  (with infl forms + freqs)                 700
 
 
5.
  a.  1a. but only top 1,000 words               25
  b.  1a. but only top 2,000 words               30
  c.  1a. but only top 5,000 words               50
Note that POS means "with part-of-speech" tags
and lemmas means that inflected forms of nouns and verbs
have been lemmatised to the base form and their several
frequencies summed.
 
Here is a brief sample of list 1a.
 
last JJ 19548
no RB 19399
where WH 18542
find V 18420
  VB find 9001
  VB found 35
  VBD found 4294
  VBG finding 1103
  VBN found 3392
  VBZ finds 595
these DTG 18349
down IN 18014
tell V 17662
  VB tell 6400
  VBD told 5999
  VBG telling 1552
  VBN told 2769
  VBZ tells 942
even RB 17523
three CD 17346
should MD 16764
pound N 16564
  NN pound 954
  NNS pounds 14875
  NP pound 22
  NP pounds 713
off IN 16210
week N 16204
  NN week 11967
  NNS weeks 4228
  NP week 4
  NP weeks 5
really RB 16080
work V 16027
  VB work 6036
  VBD worked 1732
  VBG working 5690
  VBN worked 1299
  VBZ works 1270
may MD 15774
back RB 15759
yes UH 15742
life N 15624
  NN life 13432
  NNS lifes 11
  NNS lives 2122
  NP life 46
  NP lives 13
through IN 15614
those DTG 15473
 
 
Best wishes
____________________________________________________
From:   IN%"peereman at u-bourgogne.fr"
To:     IN%"vkempe at uoft02.utoledo.edu"
CC:
Subj:   RE: Francis&Kucera
 
You can try the MRC Psycholinguistic Database. You will find
informations on the Web at http://web.inf.rl.ac.uk/proj/psych.html
 
Sincerely,
 
Ronald Peereman
 
- -----------------------------------------------------------
Ronald Peereman
Laboratoire d'Etudes des Apprentissages et du Developpement-
C.N.R.S., Universite de Bourgogne, Dijon, France
fax. (33)80395767, email: peereman at satie.u-bourgogne.fr
------------------------------------------------------------------------
LINGUIST List: Vol-7-583.



More information about the LINGUIST mailing list