[Corpora-List] COCA (and COHA) n-grams data

Mark Davies Mark_Davies at byu.edu
Mon Nov 21 20:13:10 UTC 2011


We are pleased to announce that n-grams data from the COCA and COHA corpora is now available for download (http://www.ngrams.info) -- much of it for free.

The free n-grams from COCA (http://corpus.byu.edu/coca; 425 million words, 1990-2011) contain the one million most frequent 2, 3, 4, and 5-grams. The free n-grams from COHA (http://corpus.byu.edu/coha; 400 million words, 1810-2009) contain the frequency of every word, and every 2, 3, 4, and 5-gram that occurs at least three times in the corpus, along with its frequency in each of the 20 decades (1810s-2000s). Other versions of the n-grams include *all* 2, 3, and 4-grams from COCA (e.g. 155 million 3-grams). This n-grams data is in addition to the other COCA-based word frequency and collocates data that is available from http://www.wordfrequency.info.

One advantage of the COCA and COHA n-grams over the Google n-grams (both contemporary and historical datasets) is that the COCA / COHA n-grams are tagged for part of speech (as well as lemma, for some of the COCA datasets), and that they are based on genre-balanced corpora. In addition, it is easier to install and use these n-grams on a wide variety of platforms, since the n-grams are smaller than the billions of rows of data in the Google datasets (but still large enough to hopefully be quite useful).

Anyway, for those who might be interested -- http://www.ngrams.info.

Mark Davies
Brigham Young University

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list