Corpora: Summary: corpus frequencies for psycholinguistics experiments

Sat Jul 15 19:07:20 UTC 2000

This is a summary of replies to my question about alternatives to
Francis and Kucera for getting word frequency data to use in
psycholinguistics experiments.  The rationale behind the query is
nicely summed up by one of the respondents, who wrote: "[F&K] has been
shown (by Burgess I think) to be inaccurate for many words that it
calls low frequency and of course it's out of date by now, some words
that are now considered politically incorrect for example, and
therefore, not used very often, are relatively frequent in there."

Thanks to the following people for replies:

  Chris Brew <cbrew at ling.ohio-state.edu>
  Adam Kilgarriff <Adam.Kilgarriff at itri.brighton.ac.uk>
  Jim Magnuson <magnuson at ling.ling.rochester.edu>
  Paul Rayson <paul at comp.lancs.ac.uk>
  Nina Silverberg <nsilverb at astro.ocis.temple.edu>

Here's the summary.

1a. British National Corpus (http://info.ox.ac.uk/bnc/)

   The corpus itself is available only to Europeans, but Adam
   Kilgarriff has produced word frequency lists and put them on the
   Web at http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/bnc-readme.html.
   He writes, "the lists from the BNC on my web page - particularly
   the lemmatised ones - were produced with English teaching and
   dictionaries in mind, and have been quite widely used for
   experiment-type purposes.  The BNC is clearly appropriate, as it
   was designed with 'general English' in mind.  (though it is
   British, but I suspect the differences there are quite marginal.)
   It's been getting 200 files downloaded per month for 4 years now,
   and I think it is quite widely used."

   Adam's paper

    @article{ak-ijl,
	author = "Adam Kilgarriff", title = "Putting Frequencies into
	the Dictionary", journal = "International Journal of
	Lexicography", year = 1997, volume = 10, number = 2, pages =
	{135--155}
    }

   argues for the list and explains how it was done, and there's an
   on-line copy available from his Web page.

   Paul Rayson has been working on BNC and writes:

     I have been working on frequency lists for the second version of
     the BNC (POS tagging and file headers updated) and short versions
     of those lists will appear in

       Leech, G., Wilson, A., Rayson, P. (forthcoming). Word Frequencies
       in Spoken and Written English: based on the British National
       Corpus. Longman, London.

     Due to the size of the lists, we plan to make the longer versions
     available on the UCREL website later this year when the book is
     published.

     http://www.comp.lancs.ac.uk/ucrel/

1b. BNC Online (http://sara.natcorp.ox.ac.uk/)

   Although the corpus itself is not available outside Europe, there
   is worldwide access to search capabilities, so you can search
   for instances of particular words, phrases, or patterns.  I've tried
   it and it's quite nice.

2. For the future: the American National Corpus, presumably modeled
   on the BNC.  See http://www.cs.vassar.edu/~ide/anc/.

3. Curt Burgess at UC Riverside has been building corpora from Usenet
   postings, some say.  I couldn't find a Web page on this project
   but his home page is http://locutus.ucr.edu/~curt/.

4. The CELEX database.  "They used a databse of about 17 million words
   as opposed to the 1 million from FK. However, that is a British
   English count. It would be nice if there were something available
   on the web that allowed a person to enter a word (or preferably a
   list of words) and to get a count of its frequency per million out
   of a very large corpus. Seems doable, but I don't think it's been
   done."  CELEX is on the Web at http://www.kun.nl/celex/.

Thanks again to those who replied!

  Philip