[Corpora-List] BNC n-grams

Mark Davies Mark_Davies at byu.edu
Tue Nov 10 05:41:15 UTC 2009


Is anyone aware of a source for n-grams (2-grams and 3-grams) from the BNC? I'm aware of Phrases in English (pie.usna.edu), but I'm referring to the full set of n-grams, e.g. a downloadable file with all 15,000,000+ 2-grams in the BNC. I can generate and distribute these n-grams from my BYU-BNC (http://corpus.byu.edu/bnc), but I first wanted to see whether they're already available somewhere else. I've googled this, but haven't found anything.

I guess the more basic question is whether this data would be useful. We already have, of course, the Google ngrams data, based on a "corpus" tens of thousands of times as large as the BNC. As I see it, though, the ngrams data from a structured 100-500 million word corpus might have the following advantages over the Google data:

-- at 10-15 million rows (for 2-grams; 30-40m 3-grams (??) ), small enough to actually load on most machines
-- it could include separate frequency figures for different genres (e.g. spoken, fiction, newspaper, academic)
-- since the BNC is tagged (and in my version, lemmatized as well), it would have an advantage over the untagged and unlemmatized Google data

Comments?

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list