[Corpora-List] Frequency of the pronoun I

Ken Litkowski ken at clres.com
Wed Sep 14 16:00:47 UTC 2011


This discussion has focused on only one aspect of James Pennebaker's 
work, the 'I' frequency, and perhaps not as much on his many 
contributions to content analysis, which may have even more relevance to 
discussions on this list.

Kyle Dent of Xerox has recently performed an analysis 
<http://www.parc.com/content/attachments/through-twitter-glass.pdf> of 
2400 tweets, with the aim of classifying them into "Questions" and "Not 
Questions". He developed an elaborate NLP system to deal with these 
tweets. He kindly provided me with these data, so that I could examine 
them with my content analysis program to see how well they could be 
analyzed without all the NLP superstructure. I happened to run a first 
analysis at the time of this thread. It simply compares the two sets as 
a whole.

The corpus size is 31,000 words (hardly the stature of BNC, COCA, or 
OEC). But, curiously, both "i" and "the" hold the top two frequency 
positions in both:

Set                "the"    "I"
Questions            400    327
Not Questions        437    575

Wow! Could this be a classification signature? Although this is not 
likely, various other statistics in various combinations generated in 
the program may very well be. So, here we have a micro-genre analysis 
that confirms the other comments on this thread, much like the Known 
Similarity Corpora of Adam Kilgarriff (15 years ago!).

Sentiment analysis is an emerging field, but is currently dominated by 
heavy NLP techniques. I would suggest that techniques from content 
analysis might provide a nice complement.

     Ken
-- 
Ken Litkowski        TEL.: 301-482-0237
CL Research          EMAIL: ken at clres.com
9208 Gue Road        Home Page: http://www.clres.com
Damascus, MD 20872-1025 USA Blog: http://www.clres.com/blog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110914/b73398e3/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list