[Corpora-List] Frequency of the pronoun I

Mark Davies Mark_Davies at byu.edu
Tue Sep 13 15:05:18 UTC 2011


>> Can anyone help here, clearing up the position? Someone with access to a really top quality corpus, more up to date and representative than the BNC?

COCA (http://corpus.byu.edu/coca) is more recent (20 million words each year, 1990-2011) and larger (425 million words) than the BNC. As far as representativity, it is evenly divided into spoken, fiction, popular magazines, newspaper, and academic (about 85 million words each).

>> as even in the spoken-only section of the BNC (10m words) we find I occurring only just over half as often as THE. 

Same thing in COCA -- the frequency of "I" is less than half that of "the" in spoken (note that COCA spoken is perhaps more formal than BNC spoken):

     (spoken: 85 million words)
     the 4,183,469
     I    1,592,285

Much more frequency data, including top 60,000 lemmas by genre and (40+) sub-genres: http://www.wordfrequency.info

Mark D.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Mike Scott [mike at lexically.net]
Sent: Tuesday, September 13, 2011 8:33 AM
To: corpora at uib.no
Cc: jwpennebaker at gmail.com
Subject: [Corpora-List] Frequency of the pronoun I

On page 45 of the 3 September issue of New Scientist, there is a table
giving frequencies of "the 20 most frequently used words in the English
languiage, across both spoken and written texts". The first is I, then
THE, AND, TO, A, OF, THAT... ME,ON,BUT.
I wrote to the author, James Pennemaker of the U of Texas, about this,
expressing my surprise at the pronoun I having greater frequency than
THE, as even in the spoken-only section of the BNC (10m words) we find I
occurring only just over half as often as THE. His data contains a mix
of spoken and written with a large amount of blog data. He reports that
with all his studies in the USA and Mexico, "people always use more I
more than THE.  It's never close."
Can anyone help here, clearing up the position? Someone with access to a
really top quality corpus, more up to date and representative than the BNC?

Mike

--
Mike Scott

***
If you publish research which uses WordSmith, do let me know so I can include it at
http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
***
University of Aston and Lexical Analysis Software Ltd.
mike.scott at aston.ac.uk
www.lexically.net


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list