Corpus of American English

Mark Davies Mark_Davies at BYU.EDU
Thu Jul 10 18:56:42 UTC 2008


>> > Does this method count some (?many) "words" more than once, such as
> "feeling" as both a noun and a verb?

Yes, lemmas are usually defined in part by PoS, so 'feeling' as a N and a V would be two separate lemma.

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

> -----Original Message-----
> From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf Of Joel S.
> Berson
> Sent: Thursday, July 10, 2008 10:42 AM
> To: ADS-L at LISTSERV.UGA.EDU
> Subject: Re: Corpus of American English
>
> ---------------------- Information from the mail header -----------------------
> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> Poster:       "Joel S. Berson" <Berson at ATT.NET>
> Subject:      Re: Corpus of American English
> -------------------------------------------------------------------------------
>
> Does this method count some (?many) "words" more than once, such as
> "feeling" as both a noun and a verb?
>
> Joel
>
> At 7/9/2008 11:45 PM, Mark Davies wrote:
> >For what it's worth, here's some data from the 360 million word
> >Corpus of American English (www.americancorpus.org) --
> >
> >NOUN    678,887 distinct word forms. At roughly 1.95 word forms per
> >lemma, = 347,399 nouns (doesn't include proper nouns)
> >ADJ     422,718 distinct word forms. At roughly 1.16 word forms per
> >lemma, = 362,848 adjectives
> >ADV     30,093 distinct word forms. At roughly 1.00 word forms per
> >lemma, = 30,093 adverbs
> >VERB    135,795 distinct word forms. At roughly 2.05 word forms per
> >lemma, = 66,560 verbs
> >         (Note: 2.05 does seem low, but low frequency verbs (e.g.
> > occurs only 3-4 times) won't have the full 3-4 forms (e.g. feel,
> > feels, feeling, felt)
> >
> >**TOTAL **806,900 lemmas
> >(where lemma is a function of PoS, e.g. 'strike' as N and V are two
> >separate lemma)
> >
> >There are some problems with these numbers, since they are a
> >function of the tagging and lemmatization done by CLAWS. Like any
> >tagger or lemmatizer, it has problems with unknown words (i.e. not
> >in its lexicon; mainly low-frequency items). In these cases, it's
> >essentially just guessing the lemma and PoS based on syntactic
> >position and morphology. That's why I don't just count up the number
> >of different lemmas from CLAWS, and why I calculate the number of
> >lemmas, as shown above.
> >
> >Anyway, not that the CAE contains every word in PDE, but the
> >argument could be made that if it's not in 360 million words
> >(equally divided between the five main genres), one is quite
> >unlikely to encounter it (much) in other texts.
> >
> >Best,
> >
> >Mark Davies
> >
> >============================================
> >Mark Davies
> >Professor of (Corpus) Linguistics
> >Brigham Young University
> >(phone) 801-422-9168 / (fax) 801-422-0906
> >Web: davies-linguistics.byu.edu
> >
> >** Corpus design and use // Linguistic databases **
> >** Historical linguistics // Language variation **
> >** English, Spanish, and Portuguese **
> >============================================
> >
> >------------------------------------------------------------
> >The American Dialect Society - http://www.americandialect.org
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list