Corpus of American English

LanDi Liu strangeguitars at GMAIL.COM
Fri Jul 11 02:58:53 UTC 2008


Are the numbers for BNC about the same?

On Fri, Jul 11, 2008 at 2:56 AM, Mark Davies <Mark_Davies at byu.edu> wrote:
> ---------------------- Information from the mail header -----------------------
> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> Poster:       Mark Davies <Mark_Davies at BYU.EDU>
> Subject:      Re: Corpus of American English
> -------------------------------------------------------------------------------
>
>>> > Does this method count some (?many) "words" more than once, such as
>> "feeling" as both a noun and a verb?
>
> Yes, lemmas are usually defined in part by PoS, so 'feeling' as a N and a V would be two separate lemma.
>
> Mark D.
>
> ============================================
> Mark Davies
> Professor of (Corpus) Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> Web: davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
>> -----Original Message-----
>> From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf Of Joel S.
>> Berson
>> Sent: Thursday, July 10, 2008 10:42 AM
>> To: ADS-L at LISTSERV.UGA.EDU
>> Subject: Re: Corpus of American English
>>
>> ---------------------- Information from the mail header -----------------------
>> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
>> Poster:       "Joel S. Berson" <Berson at ATT.NET>
>> Subject:      Re: Corpus of American English
>> -------------------------------------------------------------------------------
>>
>> Does this method count some (?many) "words" more than once, such as
>> "feeling" as both a noun and a verb?
>>
>> Joel
>>
>> At 7/9/2008 11:45 PM, Mark Davies wrote:
>> >For what it's worth, here's some data from the 360 million word
>> >Corpus of American English (www.americancorpus.org) --
>> >
>> >NOUN    678,887 distinct word forms. At roughly 1.95 word forms per
>> >lemma, = 347,399 nouns (doesn't include proper nouns)
>> >ADJ     422,718 distinct word forms. At roughly 1.16 word forms per
>> >lemma, = 362,848 adjectives
>> >ADV     30,093 distinct word forms. At roughly 1.00 word forms per
>> >lemma, = 30,093 adverbs
>> >VERB    135,795 distinct word forms. At roughly 2.05 word forms per
>> >lemma, = 66,560 verbs
>> >         (Note: 2.05 does seem low, but low frequency verbs (e.g.
>> > occurs only 3-4 times) won't have the full 3-4 forms (e.g. feel,
>> > feels, feeling, felt)
>> >
>> >**TOTAL **806,900 lemmas
>> >(where lemma is a function of PoS, e.g. 'strike' as N and V are two
>> >separate lemma)
>> >
>> >There are some problems with these numbers, since they are a
>> >function of the tagging and lemmatization done by CLAWS. Like any
>> >tagger or lemmatizer, it has problems with unknown words (i.e. not
>> >in its lexicon; mainly low-frequency items). In these cases, it's
>> >essentially just guessing the lemma and PoS based on syntactic
>> >position and morphology. That's why I don't just count up the number
>> >of different lemmas from CLAWS, and why I calculate the number of
>> >lemmas, as shown above.
>> >
>> >Anyway, not that the CAE contains every word in PDE, but the
>> >argument could be made that if it's not in 360 million words
>> >(equally divided between the five main genres), one is quite
>> >unlikely to encounter it (much) in other texts.
>> >
>> >Best,
>> >
>> >Mark Davies
>> >
>> >============================================
>> >Mark Davies
>> >Professor of (Corpus) Linguistics
>> >Brigham Young University
>> >(phone) 801-422-9168 / (fax) 801-422-0906
>> >Web: davies-linguistics.byu.edu
>> >
>> >** Corpus design and use // Linguistic databases **
>> >** Historical linguistics // Language variation **
>> >** English, Spanish, and Portuguese **
>> >============================================
>> >
>> >------------------------------------------------------------
>> >The American Dialect Society - http://www.americandialect.org
>>
>> ------------------------------------------------------------
>> The American Dialect Society - http://www.americandialect.org
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>



--
Randy Alexander
Jilin City, China
My Manchu studies blog:
http://www.bjshengr.com/manchu

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list