[Corpora-List] BNC n-grams

Serge Sharoff s.sharoff at leeds.ac.uk
Tue Nov 10 09:04:33 UTC 2009


I did this quite some time ago, but I never thought of this as an
achievement, since it's trivial to produce.  In case you need them,
http://corpus.leeds.ac.uk/frqc/bnc-bi.gz
(it's based on lemmas, but I didn't use POS tags).

Another advantage of the BNC over Google data is noise coming from
navigation frames (Have your say, Click here) as well as from duplicate
pages (Stefan Evert published some examples of this, nothing comes from
the top of my head).  The disadvantage of the BNC is obviously the time
frame (Soviet Union is still quite prominent there) and British English
only.
Serge


On Tue, 2009-11-10 at 05:41 +0000, Mark Davies wrote:
> Is anyone aware of a source for n-grams (2-grams and 3-grams) from the
> BNC? I'm aware of Phrases in English (pie.usna.edu), but I'm referring
> to the full set of n-grams, e.g. a downloadable file with all
> 15,000,000+ 2-grams in the BNC. I can generate and distribute these
> n-grams from my BYU-BNC (http://corpus.byu.edu/bnc), but I first
> wanted to see whether they're already available somewhere else. I've
> googled this, but haven't found anything.
> 
> I guess the more basic question is whether this data would be useful.
> We already have, of course, the Google ngrams data, based on a
> "corpus" tens of thousands of times as large as the BNC. As I see it,
> though, the ngrams data from a structured 100-500 million word corpus
> might have the following advantages over the Google data:
> 
> -- at 10-15 million rows (for 2-grams; 30-40m 3-grams (??) ), small
> enough to actually load on most machines
> -- it could include separate frequency figures for different genres
> (e.g. spoken, fiction, newspaper, academic)
> -- since the BNC is tagged (and in my version, lemmatized as well), it
> would have an advantage over the untagged and unlemmatized Google data
> 
> Comments?
> 
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> 
> http://davies-linguistics.byu.edu
> 
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================ 
> 
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list