number; statistic; Corpus of American English

Mark Davies Mark_Davies at BYU.EDU
Thu Jul 10 03:45:59 UTC 2008

For what it's worth, here's some data from the 360 million word Corpus of American English ( --

NOUN    678,887 distinct word forms. At roughly 1.95 word forms per lemma, = 347,399 nouns (doesn't include proper nouns)
ADJ     422,718 distinct word forms. At roughly 1.16 word forms per lemma, = 362,848 adjectives
ADV     30,093 distinct word forms. At roughly 1.00 word forms per lemma, = 30,093 adverbs
VERB    135,795 distinct word forms. At roughly 2.05 word forms per lemma, = 66,560 verbs
        (Note: 2.05 does seem low, but low frequency verbs (e.g. occurs only 3-4 times) won't have the full 3-4 forms (e.g. feel, feels, feeling, felt)

**TOTAL **806,900 lemmas
(where lemma is a function of PoS, e.g. 'strike' as N and V are two separate lemma)

There are some problems with these numbers, since they are a function of the tagging and lemmatization done by CLAWS. Like any tagger or lemmatizer, it has problems with unknown words (i.e. not in its lexicon; mainly low-frequency items). In these cases, it's essentially just guessing the lemma and PoS based on syntactic position and morphology. That's why I don't just count up the number of different lemmas from CLAWS, and why I calculate the number of lemmas, as shown above.

Anyway, not that the CAE contains every word in PDE, but the argument could be made that if it's not in 360 million words (equally divided between the five main genres), one is quite unlikely to encounter it (much) in other texts.


Mark Davies

Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

The American Dialect Society -

More information about the Ads-l mailing list