[Corpora-List] New COCA-based frequency lists (500,000 words)

Mark Davies Mark_Davies at byu.edu
Thu Jan 13 15:57:05 UTC 2011


For those who might be interested...

A new, free frequency list that is based on the Corpus of Contemporary American English [COCA] (http://www.americancorpus.org) is now available. It contains the nearly 500,000 word forms (+PoS tag) that occur at least four times in the 410 million word corpus. For each word form, it also includes the different parts of speech (COCA is tagged with CLAWS), the frequency, and the number of texts (in the 169,000 texts in COCA) in which the word occurs.

The list might be useful for comparing British and American English (cf. the BNC wordlists, e.g. http://ucrel.lancs.ac.uk/bncfreq/flists.html or http://www.kilgarriff.co.uk/bnc-readme.html), or for lexical data from a corpus that is somewhat larger and more recent than the BNC (410 vs 100 million words, and current through mid-2010 vs. 1993 for the BNC).

The list can be downloaded from http://www.wordfrequency.info.

Best,

Mark Davies

----------------------------

P.S. As some of you may be aware, there are also a number of other COCA-based frequency materials at http://www.wordfrequency.info:

* Lemmatized lists -- 5,000 - 60,000 lemmas -- carefully reviewed for accuracy.
* Genre-based frequency. The frequency of each of the 60,000 words in the five main genres (spoken, fiction, popular magazines, newspapers, and academic journals), as well as more than 40 sub-genres
  (e.g. Fiction-Movie scripts, Magazine-Sports, Newspaper-Financial, or Academic-Medicine).
* Collocates: As many as 200-300 collocates for the top 60,000 lemmas (total of 4.8 million node/collocate pairs).
* N-grams: The frequency of all 155 million 3-grams (three-word strings) in the corpus. 

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu
 
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list