Corpus of American English

Mark Davies Mark_Davies at BYU.EDU
Fri Jul 11 13:44:23 UTC 2008


> Are the numbers for BNC about the same?

The BNC has:

PoS     Types * Lemmas  Types per lemma
----    ----            ----            ----
N       257620  375571  0.69
V       57072           39055           1.46
AJ      127608  126463  1.01
AV      9316            9278            1.00
----
TOTAL: 550,367 lemmas

(*Types = unique word forms, so realize, realizes, realizing, etc are different types)

Notes --

1. Notice that there are more lemmas for nouns than types, which of course can't really be. Again, a function of incorrect lemmatization by CLAWS for very low frequency items.
2. So the total -- rather than ~550,000 lemmas, is probably more like ~450,000 lemmas.
3. This ~450,000 or so is less than the ~800,000 from the Corpus of American English, but it's in large part a function of corpus size (BNC = 100 million words, CAE = 360 million words). Not the ideal scenario -- we'd like the number to be a function of what's really going on in the dialect; not just the corpus.
4. But this also shows why very small corpora (like the 22 million word American National Corpus) would *really* have problems -- especially when that corpus is so unbalanced in terms of text types (only 500,000 words of fiction (cf. to 72 million in the CAE), or 1/7 of the ANC from a blog dealing with "Buffy the Vampire Slayer" -- see the "Comparison with the ANC" link at www.americancorpus.org).

Best,

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

> -----Original Message-----
> From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf Of LanDi
> Liu
> Sent: Thursday, July 10, 2008 8:59 PM
> To: ADS-L at LISTSERV.UGA.EDU
> Subject: Re: Corpus of American English
>
> ---------------------- Information from the mail header -----------------------
> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> Poster:       LanDi Liu <strangeguitars at GMAIL.COM>
> Subject:      Re: Corpus of American English
> -------------------------------------------------------------------------------
>
> Are the numbers for BNC about the same?
>
> On Fri, Jul 11, 2008 at 2:56 AM, Mark Davies <Mark_Davies at byu.edu> wrote:
> > ---------------------- Information from the mail header -----------------------
> > Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> > Poster:       Mark Davies <Mark_Davies at BYU.EDU>
> > Subject:      Re: Corpus of American English
> > -------------------------------------------------------------------------------
> >
> >>> > Does this method count some (?many) "words" more than once, such as
> >> "feeling" as both a noun and a verb?
> >
> > Yes, lemmas are usually defined in part by PoS, so 'feeling' as a N and a V would be
> two separate lemma.
> >
> > Mark D.
> >
> > ============================================
> > Mark Davies
> > Professor of (Corpus) Linguistics
> > Brigham Young University
> > (phone) 801-422-9168 / (fax) 801-422-0906
> > Web: davies-linguistics.byu.edu
> >
> > ** Corpus design and use // Linguistic databases **
> > ** Historical linguistics // Language variation **
> > ** English, Spanish, and Portuguese **
> > ============================================
> >
> >> -----Original Message-----
> >> From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf Of
> Joel S.
> >> Berson
> >> Sent: Thursday, July 10, 2008 10:42 AM
> >> To: ADS-L at LISTSERV.UGA.EDU
> >> Subject: Re: Corpus of American English
> >>
> >> ---------------------- Information from the mail header -----------------------
> >> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> >> Poster:       "Joel S. Berson" <Berson at ATT.NET>
> >> Subject:      Re: Corpus of American English
> >> -------------------------------------------------------------------------------
> >>
> >> Does this method count some (?many) "words" more than once, such as
> >> "feeling" as both a noun and a verb?
> >>
> >> Joel
> >>
> >> At 7/9/2008 11:45 PM, Mark Davies wrote:
> >> >For what it's worth, here's some data from the 360 million word
> >> >Corpus of American English (www.americancorpus.org) --
> >> >
> >> >NOUN    678,887 distinct word forms. At roughly 1.95 word forms per
> >> >lemma, = 347,399 nouns (doesn't include proper nouns)
> >> >ADJ     422,718 distinct word forms. At roughly 1.16 word forms per
> >> >lemma, = 362,848 adjectives
> >> >ADV     30,093 distinct word forms. At roughly 1.00 word forms per
> >> >lemma, = 30,093 adverbs
> >> >VERB    135,795 distinct word forms. At roughly 2.05 word forms per
> >> >lemma, = 66,560 verbs
> >> >         (Note: 2.05 does seem low, but low frequency verbs (e.g.
> >> > occurs only 3-4 times) won't have the full 3-4 forms (e.g. feel,
> >> > feels, feeling, felt)
> >> >
> >> >**TOTAL **806,900 lemmas
> >> >(where lemma is a function of PoS, e.g. 'strike' as N and V are two
> >> >separate lemma)
> >> >
> >> >There are some problems with these numbers, since they are a
> >> >function of the tagging and lemmatization done by CLAWS. Like any
> >> >tagger or lemmatizer, it has problems with unknown words (i.e. not
> >> >in its lexicon; mainly low-frequency items). In these cases, it's
> >> >essentially just guessing the lemma and PoS based on syntactic
> >> >position and morphology. That's why I don't just count up the number
> >> >of different lemmas from CLAWS, and why I calculate the number of
> >> >lemmas, as shown above.
> >> >
> >> >Anyway, not that the CAE contains every word in PDE, but the
> >> >argument could be made that if it's not in 360 million words
> >> >(equally divided between the five main genres), one is quite
> >> >unlikely to encounter it (much) in other texts.
> >> >
> >> >Best,
> >> >
> >> >Mark Davies
> >> >
> >> >============================================
> >> >Mark Davies
> >> >Professor of (Corpus) Linguistics
> >> >Brigham Young University
> >> >(phone) 801-422-9168 / (fax) 801-422-0906
> >> >Web: davies-linguistics.byu.edu
> >> >
> >> >** Corpus design and use // Linguistic databases **
> >> >** Historical linguistics // Language variation **
> >> >** English, Spanish, and Portuguese **
> >> >============================================
> >> >
> >> >------------------------------------------------------------
> >> >The American Dialect Society - http://www.americandialect.org
> >>
> >> ------------------------------------------------------------
> >> The American Dialect Society - http://www.americandialect.org
> >
> > ------------------------------------------------------------
> > The American Dialect Society - http://www.americandialect.org
> >
>
>
>
> --
> Randy Alexander
> Jilin City, China
> My Manchu studies blog:
> http://www.bjshengr.com/manchu
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list