COCA Corpus of Contemporary American English (UNCLASSIFIED)

Tom Zurinskas truespel at HOTMAIL.COM
Tue Feb 23 06:03:44 UTC 2010


I played with the 5k COCA database by putting it into a spreadsheet.  The data are below.  If you copy/paste it into excel it might propagate better into rows and columns.

code instances % words % top words
n 67,636,257 20.5% 2,543 50.9% year
v 64,351,414 19.5% 1,000 20.0% be
i 42,641,196 12.9% 97 1.9% of
a 39,296,373 11.9% 11 0.2% the
p 24,983,795 7.6% 46 0.9% I
c 23,197,479 7.0% 38 0.8% and
r 21,415,134 6.5% 340 6.8% up
j 20,020,528 6.1% 839 16.8% other
d 11,601,874 3.5% 34 0.7% this
t 6,332,195 1.9% 1 0.0% to
m 3,737,047 1.1% 35 0.7% one
x 3,138,830 1.0% 2 0.0% not
e 784,528 0.2% 1 0.0% there
u 523,135 0.2% 13 0.3% yes
 329,659,785 100.0% 5,000 100.0%

The data show that nouns are the most popular parts of speech.  Half of the database is nouns (2,543 out of 5,000).  But according to "instances" they make up 20.5% of the instances which slightly passes verbs which make up 19.5% with only 1,000 verbs.  Thus verbs have about twice the repetition rate of nouns.

The count for the 5k most popular words in this database is 329,659,785.  This may be off by a few thousand because some 5 numbers had to be approximated because they didn't download.  Can't figure why.

The number one most popular word was "the" with 22,038,615 instances or 6.7% of the database.  So every 15th word we say or read is "the" (on average).


Tom Zurinskas, USA - CT20, TN3, NJ33, FL7+
see truespel.com phonetic spelling





> ---------------------- Information from the mail header -----------------------
> Sender: American Dialect Society
> Poster: "Mullins, Bill AMRDEC"
> Subject: Re: COCA Corpus of Contemporary American English (UNCLASSIFIED)
> -------------------------------------------------------------------------------
>
> Classification: UNCLASSIFIED
> Caveats: NONE
>
>
>
>> -----Original Message-----
>> From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On
>> Behalf Of Tom Zurinskas
>> Sent: Monday, February 22, 2010 4:06 PM
>> To: ADS-L at LISTSERV.UGA.EDU
>> Subject: COCA Corpus of Contemporary American English
>>
>> ---------------------- Information from the mail header
> ---------------
>> --------
>> Sender: American Dialect Society
>> Poster: Tom Zurinskas
>> Subject: COCA Corpus of Contemporary American English
>>
> -----------------------------------------------------------------------
>> --------
>>
>> Thanks for the COCA site edress
>>
>> http://www.americancorpus.org/
>>
>> This site has a word frequency list of the most popular 5k words in
>> English. I wonder where the words are from - newspapers? internet?.
>>
>
>
> The home page for the site says right there: " The corpus contains more
> than 400 million words of text and is equally divided among spoken,
> fiction, popular magazines, newspapers, and academic texts."
> Classification: UNCLASSIFIED
> Caveats: NONE
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
_________________________________________________________________
Hotmail: Trusted email with powerful SPAM protection.
http://clk.atdmt.com/GBL/go/201469227/direct/01/

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list