[Corpora-List] COCA (and COHA) n-grams data

Toddy Mladenov toddysm at gmail.com
Mon Nov 21 20:28:31 UTC 2011


Mark,

The free list link gives error. You may want to look at it.

Toddy
Twitter: @toddysm
Blog: http://www.toddysm.com
Sent from my Blackberry

-----Original Message-----
From: Mark Davies <Mark_Davies at byu.edu>
Sender: corpora-bounces at uib.no
Date: Mon, 21 Nov 2011 20:13:10 
To: corpora at uib.no<corpora at uib.no>
Subject: [Corpora-List] COCA (and COHA) n-grams data

We are pleased to announce that n-grams data from the COCA and COHA corpora is now available for download (http://www.ngrams.info) -- much of it for free.

The free n-grams from COCA (http://corpus.byu.edu/coca; 425 million words, 1990-2011) contain the one million most frequent 2, 3, 4, and 5-grams. The free n-grams from COHA (http://corpus.byu.edu/coha; 400 million words, 1810-2009) contain the frequency of every word, and every 2, 3, 4, and 5-gram that occurs at least three times in the corpus, along with its frequency in each of the 20 decades (1810s-2000s). Other versions of the n-grams include *all* 2, 3, and 4-grams from COCA (e.g. 155 million 3-grams). This n-grams data is in addition to the other COCA-based word frequency and collocates data that is available from http://www.wordfrequency.info.

One advantage of the COCA and COHA n-grams over the Google n-grams (both contemporary and historical datasets) is that the COCA / COHA n-grams are tagged for part of speech (as well as lemma, for some of the COCA datasets), and that they are based on genre-balanced corpora. In addition, it is easier to install and use these n-grams on a wide variety of platforms, since the n-grams are smaller than the billions of rows of data in the Google datasets (but still large enough to hopefully be quite useful).

Anyway, for those who might be interested -- http://www.ngrams.info.

Mark Davies
Brigham Young University

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list