[Corpora-List] Frequency of the pronoun I
Marc Brysbaert
marc.brysbaert at ugent.be
Tue Sep 13 15:33:31 UTC 2011
Maybe we can turn the question around and use the "the/I" ratio as an index
of how socially vs. description oriented a corpus is? Here is a summary of
the data I have at hand. Marc
Source
the
I
ratio
COCA (academic)
5549547
204916
0.04
COCA (newspapers)
4648992
506030
0.11
Google (books)
22914473646
2744649681
0.12
COCA (magazines)
4878925
648344
0.13
American blogs
4200000
1300000
0.31
COCA (fiction)
4534433
1576303
0.35
COCA (television programs)
4190341
1623705
0.39
Shakespearean plays
182400
239200
1.31
SUBTLEX (film subtitles)
1501908
2038529
1.36
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Martin Mueller
Sent: dinsdag 13 september 2011 17:15
To: corpora at uib.no
Cc: jwpennebaker at gmail.com
Subject: Re: [Corpora-List] Frequency of the pronoun I
Unsurprisingly, in a corpus of ~ six million words from 320 plays of
Shakespeare's generation (broadly speaking) forms of 'be' and 'I' dominate
(with respectively 245,500 and and 239,200 occurrences, followed at some
distance by 'the' (182,400) and 'and' (180,00). Everything does indeed
depend on the text type.
From: Adam Kilgarriff <adam at lexmasterclass.com>
Date: Tue, 13 Sep 2011 15:51:06 +0100
To: Mike Scott <mike at lexically.net>
Cc: <corpora at uib.no>, <jwpennebaker at gmail.com>
Subject: Re: [Corpora-List] Frequency of the pronoun I
Everything depends on text type.
BNC-spoken overall has more 'the' than 'I' but that's because half of it is
meetings/lectures/sermons. If you look only at the conversational part
(obscurely called "demographic") 'I' is more common, in keeping with the
kinds of language that James Pennebaker works with (from my recollection of
a fascinating talk of his I went to)
Asking for a more representative corpus won't help because we all have
different ideas about what it should be representative of
Adam
On 13 September 2011 15:33, Mike Scott <mike at lexically.net> wrote:
On page 45 of the 3 September issue of New Scientist, there is a table
giving frequencies of "the 20 most frequently used words in the English
languiage, across both spoken and written texts". The first is I, then THE,
AND, TO, A, OF, THAT... ME,ON,BUT.
I wrote to the author, James Pennemaker of the U of Texas, about this,
expressing my surprise at the pronoun I having greater frequency than THE,
as even in the spoken-only section of the BNC (10m words) we find I
occurring only just over half as often as THE. His data contains a mix of
spoken and written with a large amount of blog data. He reports that with
all his studies in the USA and Mexico, "people always use more I more than
THE. It's never close."
Can anyone help here, clearing up the position? Someone with access to a
really top quality corpus, more up to date and representative than the BNC?
Mike
--
Mike Scott
***
If you publish research which uses WordSmith, do let me know so I can
include it at
http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wor
dsmith.htm
***
University of Aston and Lexical Analysis Software Ltd.
mike.scott at aston.ac.uk
www.lexically.net
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
--
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director Lexical Computing Ltd
<http://www.sketchengine.co.uk/>
Visiting Research Fellow University of Leeds
<http://leeds.ac.uk>
Corpora for all with the Sketch Engine <http://www.sketchengine.co.uk>
DANTE: <http://www.webdante.com> a lexical database
for English
========================================
_______________________________________________ UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110913/9fc89403/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list