Corpus of American English

Joel S. Berson Berson at ATT.NET
Thu Jul 10 16:42:25 UTC 2008

Does this method count some (?many) "words" more than once, such as
"feeling" as both a noun and a verb?


At 7/9/2008 11:45 PM, Mark Davies wrote:
>For what it's worth, here's some data from the 360 million word
>Corpus of American English ( --
>NOUN    678,887 distinct word forms. At roughly 1.95 word forms per
>lemma, = 347,399 nouns (doesn't include proper nouns)
>ADJ     422,718 distinct word forms. At roughly 1.16 word forms per
>lemma, = 362,848 adjectives
>ADV     30,093 distinct word forms. At roughly 1.00 word forms per
>lemma, = 30,093 adverbs
>VERB    135,795 distinct word forms. At roughly 2.05 word forms per
>lemma, = 66,560 verbs
>         (Note: 2.05 does seem low, but low frequency verbs (e.g.
> occurs only 3-4 times) won't have the full 3-4 forms (e.g. feel,
> feels, feeling, felt)
>**TOTAL **806,900 lemmas
>(where lemma is a function of PoS, e.g. 'strike' as N and V are two
>separate lemma)
>There are some problems with these numbers, since they are a
>function of the tagging and lemmatization done by CLAWS. Like any
>tagger or lemmatizer, it has problems with unknown words (i.e. not
>in its lexicon; mainly low-frequency items). In these cases, it's
>essentially just guessing the lemma and PoS based on syntactic
>position and morphology. That's why I don't just count up the number
>of different lemmas from CLAWS, and why I calculate the number of
>lemmas, as shown above.
>Anyway, not that the CAE contains every word in PDE, but the
>argument could be made that if it's not in 360 million words
>(equally divided between the five main genres), one is quite
>unlikely to encounter it (much) in other texts.
>Mark Davies
>Mark Davies
>Professor of (Corpus) Linguistics
>Brigham Young University
>(phone) 801-422-9168 / (fax) 801-422-0906
>** Corpus design and use // Linguistic databases **
>** Historical linguistics // Language variation **
>** English, Spanish, and Portuguese **
>The American Dialect Society -

The American Dialect Society -

More information about the Ads-l mailing list