[Corpora-List] Frequency of the pronoun I

Tue Sep 13 15:33:31 UTC 2011

Maybe we can turn the question around and use the "the/I" ratio as an index
of how socially vs. description oriented a corpus is? Here is a summary of
the data I have at hand. Marc

Source

the

I

ratio

COCA (academic)

5549547

204916

0.04

COCA (newspapers)

4648992

506030

0.11

Google (books)

22914473646

2744649681

0.12

COCA (magazines)

4878925

648344

0.13

American blogs

4200000

1300000

0.31

COCA (fiction)

4534433

1576303

0.35

COCA (television programs)

4190341

1623705

0.39

Shakespearean plays

182400

239200

1.31

SUBTLEX (film subtitles)

1501908

2038529

1.36

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Martin Mueller
Sent: dinsdag 13 september 2011 17:15
To: corpora at uib.no
Cc: jwpennebaker at gmail.com
Subject: Re: [Corpora-List] Frequency of the pronoun I

Unsurprisingly, in a corpus of ~ six million words from 320 plays of
Shakespeare's generation (broadly speaking) forms of 'be' and 'I' dominate
(with respectively 245,500 and and 239,200 occurrences, followed at some
distance by 'the' (182,400) and 'and' (180,00).  Everything does indeed
depend on the text type. 

From: Adam Kilgarriff <adam at lexmasterclass.com>
Date: Tue, 13 Sep 2011 15:51:06 +0100
To: Mike Scott <mike at lexically.net>
Cc: <corpora at uib.no>, <jwpennebaker at gmail.com>
Subject: Re: [Corpora-List] Frequency of the pronoun I

Everything depends on text type.

BNC-spoken overall has more 'the' than 'I' but that's because half of it is
meetings/lectures/sermons.  If you look only at the conversational part
(obscurely called "demographic") 'I' is more common, in keeping with the
kinds of language that James Pennebaker works with (from my recollection of
a fascinating talk of his I went to)

Asking for a more representative corpus won't help because we all have
different ideas about what it should be representative of

Adam

On 13 September 2011 15:33, Mike Scott <mike at lexically.net> wrote:

On page 45 of the 3 September issue of New Scientist, there is a table
giving frequencies of "the 20 most frequently used words in the English
languiage, across both spoken and written texts". The first is I, then THE,
AND, TO, A, OF, THAT... ME,ON,BUT.
I wrote to the author, James Pennemaker of the U of Texas, about this,
expressing my surprise at the pronoun I having greater frequency than THE,
as even in the spoken-only section of the BNC (10m words) we find I
occurring only just over half as often as THE. His data contains a mix of
spoken and written with a large amount of blog data. He reports that with
all his studies in the USA and Mexico, "people always use more I more than
THE.  It's never close."
Can anyone help here, clearing up the position? Someone with access to a
really top quality corpus, more up to date and representative than the BNC?

Mike

-- 
Mike Scott

***
If you publish research which uses WordSmith, do let me know so I can
include it at
http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wor
dsmith.htm
***
University of Aston and Lexical Analysis Software Ltd.
mike.scott at aston.ac.uk
www.lexically.net

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com                                             
Director                                    Lexical Computing Ltd
<http://www.sketchengine.co.uk/>                 
Visiting Research Fellow                 University of Leeds
<http://leeds.ac.uk>      

Corpora for all with the Sketch Engine <http://www.sketchengine.co.uk>

                        DANTE: <http://www.webdante.com>  a lexical database
for English                  

========================================

_______________________________________________ UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no
http://mailman.uib.no/listinfo/corpora 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110913/9fc89403/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora