[Corpora-List] Frequency of the pronoun I

Tue Sep 13 15:14:44 UTC 2011

Unsurprisingly, in a corpus of ~ six million words from 320 plays of
Shakespeare's generation (broadly speaking) forms of 'be' and 'I' dominate
(with respectively 245,500 and and 239,200 occurrences, followed at some
distance by 'the' (182,400) and 'and' (180,00).  Everything does indeed
depend on the text type.

From:  Adam Kilgarriff <adam at lexmasterclass.com>
Date:  Tue, 13 Sep 2011 15:51:06 +0100
To:  Mike Scott <mike at lexically.net>
Cc:  <corpora at uib.no>, <jwpennebaker at gmail.com>
Subject:  Re: [Corpora-List] Frequency of the pronoun I

Everything depends on text type.

BNC-spoken overall has more 'the' than 'I' but that's because half of it is
meetings/lectures/sermons.  If you look only at the conversational part
(obscurely called "demographic") 'I' is more common, in keeping with the
kinds of language that James Pennebaker works with (from my recollection of
a fascinating talk of his I went to)

Asking for a more representative corpus won't help because we all have
different ideas about what it should be representative of

Adam

On 13 September 2011 15:33, Mike Scott <mike at lexically.net> wrote:
> On page 45 of the 3 September issue of New Scientist, there is a table giving
> frequencies of "the 20 most frequently used words in the English languiage,
> across both spoken and written texts". The first is I, then THE, AND, TO, A,
> OF, THAT... ME,ON,BUT.
> I wrote to the author, James Pennemaker of the U of Texas, about this,
> expressing my surprise at the pronoun I having greater frequency than THE, as
> even in the spoken-only section of the BNC (10m words) we find I occurring
> only just over half as often as THE. His data contains a mix of spoken and
> written with a large amount of blog data. He reports that with all his studies
> in the USA and Mexico, "people always use more I more than THE.  It's never
> close."
> Can anyone help here, clearing up the position? Someone with access to a
> really top quality corpus, more up to date and representative than the BNC?
> 
> Mike
> 
> -- 
> Mike Scott
> 
> ***
> If you publish research which uses WordSmith, do let me know so I can include
> it at
> http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_words
> mith.htm 
> <http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_word
> smith.htm> 
> ***
> University of Aston and Lexical Analysis Software Ltd.
> mike.scott at aston.ac.uk
> www.lexically.net <http://www.lexically.net>
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> <http://mailman.uib.no/listinfo/corpora>

-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing Ltd
<http://www.sketchengine.co.uk/>
Visiting Research Fellow                 University of Leeds
<http://leeds.ac.uk>
Corpora for all with the Sketch Engine <http://www.sketchengine.co.uk>
                        DANTE: a lexical database for English
<http://www.webdante.com>
========================================

_______________________________________________ UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110913/cab25033/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora