[Corpora-List] Frequency of the pronoun I

Thu Sep 15 17:28:58 UTC 2011

Dear Mike,

True, but aren't those kinds of corpora very rare?
Most corpora will have "the" somewhere, and only
lists (as your example) and other extreme outlier
corpora will not have it.  

One alternative that was suggested earlier
(1+count(I))/(1+count(the)) could avoid the divide
by zero for those corpora, but I prefer the
absolute count X to the relative one (1+X) since
it doesn't vary in range depending on the total
number of words in a corpus.  Small corpora (e.g.
a single patent claim sentence of 30 words) could
be skewed arbitrarily much by the 1+ solution.  

Perhaps another more multidimensional solutions
would be something like (I + You + your + yours +
we + our + ours + they + their + theirs + he + his
+ him + she + hers + her) / (a + the) might be
also useful if we take some samples of various
corpora types and find clusters in the plane of
those metrics.  

Even the objective (it + its) ratio with the other
pronouns might be a useful predictor.  

-Rich

Sincerely,

Rich Cooper

EnglishLogicKernel.com

Rich AT EnglishLogicKernel DOT com

9 4 9 \ 5 2 5 - 5 7 1 2

  _____  

From: corpora-bounces at uib.no
[mailto:corpora-bounces at uib.no] On Behalf Of Mike
Scott
Sent: Thursday, September 15, 2011 12:43 AM
To: corpora at uib.no
Subject: Re: [Corpora-List] Frequency of the
pronoun I

There are also English texts without THE (lists of
products, election results etc.) so the
computation either way would need to avoid
dividing by zero...

What a useful discussion. Clarified a particularly
cluttered and dusty corner of my own thinking.

Cheers -- Mike

On 13/09/2011 19:19, Rich Cooper wrote: 

Using "the/I" can lead to infinite values in
corpora (scientific lit, patents) that never use
the pronoun "I".  It might be better practice to
use the inverse, i.e. the "I/the" ration, which
would be 0.0 for such corpora.  

-- 
Mike Scott

***
If you publish research which uses WordSmith, do
let me know so I can include it at
http://www.lexically.net/wordsmith/corpus_linguist
ics_links/papers_using_wordsmith.htm
***
University of Aston and Lexical Analysis Software
Ltd.
mike.scott at aston.ac.uk
www.lexically.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110915/43d3248b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora