[Corpora-List] Frequency of the pronoun I

Rich Cooper rich at englishlogickernel.com
Tue Sep 13 18:19:44 UTC 2011


Using "the/I" can lead to infinite values in
corpora (scientific lit, patents) that never use
the pronoun "I".  It might be better practice to
use the inverse, i.e. the "I/the" ration, which
would be 0.0 for such corpora.  Perhaps there are
languages (Russian?) in which the pronoun would
never be used anywhere, but in English the measure
seems well chosen.  

 

It is striking how clearly your figures indicate
how well that single measure works as an
indication of corpus character.  Thanks for a
useful metric.  It might even be used to identify
a significant measure of subjectivity in the
corpus.  

 

-Rich

 

Sincerely,

Rich Cooper

EnglishLogicKernel.com

Rich AT EnglishLogicKernel DOT com

9 4 9 \ 5 2 5 - 5 7 1 2

  _____  

From: corpora-bounces at uib.no
[mailto:corpora-bounces at uib.no] On Behalf Of Marc
Brysbaert
Sent: Tuesday, September 13, 2011 8:34 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] Frequency of the
pronoun I

 

Maybe we can turn the question around and use the
"the/I" ratio as an index of how socially vs.
description oriented a corpus is? Here is a
summary of the data I have at hand. Marc

 


Source

the

I

ratio


 

 

 

 


COCA (academic)

5549547

204916

0.04


COCA (newspapers)

4648992

506030

0.11


Google (books)

22914473646

2744649681

0.12


COCA (magazines)

4878925

648344

0.13


American blogs

4200000

1300000

0.31


COCA (fiction)

4534433

1576303

0.35


COCA (television programs)

4190341

1623705

0.39


Shakespearean plays

182400

239200

1.31


SUBTLEX (film subtitles)

1501908

2038529

1.36

 

 

From: corpora-bounces at uib.no
[mailto:corpora-bounces at uib.no] On Behalf Of
Martin Mueller
Sent: dinsdag 13 september 2011 17:15
To: corpora at uib.no
Cc: jwpennebaker at gmail.com
Subject: Re: [Corpora-List] Frequency of the
pronoun I

 

Unsurprisingly, in a corpus of ~ six million words
from 320 plays of Shakespeare's generation
(broadly speaking) forms of 'be' and 'I' dominate
(with respectively 245,500 and and 239,200
occurrences, followed at some distance by 'the'
(182,400) and 'and' (180,00).  Everything does
indeed depend on the text type. 

 

From: Adam Kilgarriff <adam at lexmasterclass.com>
Date: Tue, 13 Sep 2011 15:51:06 +0100
To: Mike Scott <mike at lexically.net>
Cc: <corpora at uib.no>, <jwpennebaker at gmail.com>
Subject: Re: [Corpora-List] Frequency of the
pronoun I

 

Everything depends on text type.

BNC-spoken overall has more 'the' than 'I' but
that's because half of it is
meetings/lectures/sermons.  If you look only at
the conversational part (obscurely called
"demographic") 'I' is more common, in keeping with
the kinds of language that James Pennebaker works
with (from my recollection of a fascinating talk
of his I went to)

 

Asking for a more representative corpus won't help
because we all have different ideas about what it
should be representative of

 

Adam

 

On 13 September 2011 15:33, Mike Scott
<mike at lexically.net> wrote:

On page 45 of the 3 September issue of New
Scientist, there is a table giving frequencies of
"the 20 most frequently used words in the English
languiage, across both spoken and written texts".
The first is I, then THE, AND, TO, A, OF, THAT...
ME,ON,BUT.
I wrote to the author, James Pennemaker of the U
of Texas, about this, expressing my surprise at
the pronoun I having greater frequency than THE,
as even in the spoken-only section of the BNC (10m
words) we find I occurring only just over half as
often as THE. His data contains a mix of spoken
and written with a large amount of blog data. He
reports that with all his studies in the USA and
Mexico, "people always use more I more than THE.
It's never close."
Can anyone help here, clearing up the position?
Someone with access to a really top quality
corpus, more up to date and representative than
the BNC?

Mike

-- 
Mike Scott

***
If you publish research which uses WordSmith, do
let me know so I can include it at
http://www.lexically.net/wordsmith/corpus_linguist
ics_links/papers_using_wordsmith.htm
***
University of Aston and Lexical Analysis Software
Ltd.
mike.scott at aston.ac.uk
www.lexically.net


_______________________________________________
UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora





 

-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com

Director
Lexical Computing Ltd
<http://www.sketchengine.co.uk/>                 
Visiting Research Fellow
University of Leeds <http://leeds.ac.uk>      

Corpora for all with the Sketch Engine
<http://www.sketchengine.co.uk>                  

                        DANTE:
<http://www.webdante.com>  a lexical database for
English                  

========================================

 

_______________________________________________
UNSUBSCRIBE from this page:
http://mailman.uib.no/options/corpora Corpora
mailing list Corpora at uib.no
http://mailman.uib.no/listinfo/corpora 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110913/d6a5af5b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list