[Corpora-List] COCA-based word frequency resources: 4, 300, 000 collocate pairs, n-grams, new ebook

Mark Davies Mark_Davies at byu.edu
Fri Mar 12 15:33:12 UTC 2010


We have just released the final set of word frequency resources that are based on the 400 million word Corpus of Contemporary American English<http://www.americancorpus.org/> (COCA). More information -- including samples of each type of resource -- is available from http://www.wordfrequency.info<http://www.wordfrequency.info/>.



Now available:



1. Top 20,000 lemmas<http://www.wordfrequency.info/files/entriesWithCollocates.zip>, with the top 200-300 collocates per word -- for a total of more than 4,300,000 word / collocate pairs. For each pair, it gives:

  -- Frequency of the collocate

  -- Mutual Information score

  -- Pre/post ratio of collocate with regards to node word



2. N-grams<http://www.wordfrequency.info/?l=ngrams.asp>. All 155,000,000 trigrams in the corpus -- along with their frequency -- linked to a lexicon with word form (+/- case sensitive), part of speech, and lemma. Will need to use SQL joins to extract the data. Given the structure of the data, the bigrams can be easily generated from the trigrams list.



3. New eBook<http://www.wordfrequency.info/files/entries.pdf> version (more for student / learner use)

  -- Top 20,000 words in English in order of frequency

  -- 20-30 collocates (nearby words) and synonyms for each word

  -- Other frequency information, including indication of variation by genre



Formats previously announced:



-- Printed book (Routledge<http://www.routledge.com/books/A-Frequency-Dictionary-of-Contemporary-American-English-isbn9780415490634>, 2010) (top 5,000 entries, collocates, thematic lists)

-- Free<http://www.wordfrequency.info/free> listing of the top 5,000 words (without collocates or synonyms).



============================================

Mark Davies

Professor of (Corpus) Linguistics

Brigham Young University

(phone) 801-422-9168 / (fax) 801-422-0906



http://davies-linguistics.byu.edu



** Corpus design and use // Linguistic databases **

** Historical linguistics // Language variation **

** English, Spanish, and Portuguese **

============================================




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100312/e5ef9bfb/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list