[Corpora-List] Frequency of the pronoun I

Burger, John D. john at mitre.org
Thu Sep 15 18:55:24 UTC 2011


Mike Scott wrote:

> On page 45 of the 3 September issue of New Scientist, there is a table 
> giving frequencies of "the 20 most frequently used words in the English 
> languiage, across both spoken and written texts". The first is I, then 
> THE, AND, TO, A, OF, THAT... ME,ON,BUT.
> I wrote to the author, James Pennemaker of the U of Texas, about this, 
> expressing my surprise at the pronoun I having greater frequency than 
> THE, as even in the spoken-only section of the BNC (10m words) we find I 
> occurring only just over half as often as THE. His data contains a mix 
> of spoken and written with a large amount of blog data. He reports that 
> with all his studies in the USA and Mexico, "people always use more I 
> more than THE.  It's never close."
> Can anyone help here, clearing up the position? Someone with access to a 
> really top quality corpus, more up to date and representative than the BNC?

Representative of what?  If the question is "what is the most frequent word in the English language", the next question should be what is meant by "the English language".  (Also, what is meant by "word", but let's not go there.)  To me, the obvious extensional definition of "the English language" is all the English utterances produced in some time period, whether written or spoken.  The vast majority of these are clearly spoken, even for recent time frames - many people go days without writing anything at all.  I'm not certain, but I wouldn't be surprised if a truly "representative" corpus was >99% speech.  (And thus all of the corpora discussed in this thread, including Pennebaker's, are completely unrepresentative.)  Given this, I'm not terribly surprised that "the most frequent word in the English language" is "I".

- John Burger
  MITRE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2524 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110915/d651eae7/attachment-0001.bin>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list