[Corpora-List] BNC n-grams
Mark Davies
Mark_Davies at byu.edu
Tue Nov 10 05:41:15 UTC 2009
Is anyone aware of a source for n-grams (2-grams and 3-grams) from the BNC? I'm aware of Phrases in English (pie.usna.edu), but I'm referring to the full set of n-grams, e.g. a downloadable file with all 15,000,000+ 2-grams in the BNC. I can generate and distribute these n-grams from my BYU-BNC (http://corpus.byu.edu/bnc), but I first wanted to see whether they're already available somewhere else. I've googled this, but haven't found anything.
I guess the more basic question is whether this data would be useful. We already have, of course, the Google ngrams data, based on a "corpus" tens of thousands of times as large as the BNC. As I see it, though, the ngrams data from a structured 100-500 million word corpus might have the following advantages over the Google data:
-- at 10-15 million rows (for 2-grams; 30-40m 3-grams (??) ), small enough to actually load on most machines
-- it could include separate frequency figures for different genres (e.g. spoken, fiction, newspaper, academic)
-- since the BNC is tagged (and in my version, lemmatized as well), it would have an advantage over the untagged and unlemmatized Google data
Comments?
============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list