[Corpora-List] Average Daily Word Exposure

Wed Aug 18 14:12:32 UTC 2010

Hi all,

Thanks for the responses. I got a number of offline responses, here is a
collation of all the pointers I received.
While there doesn't seem to be one conclusive or even complete study,
piecing together various studies I think gives a pretty good estimate. See
the second blog post below for a fairly thorough (and sourced!) analysis.

====
Brett Reynolds:

All I can offer you is my own back-of-the-envelope calculation here:

<
http://english-jack.blogspot.com/2007/07/idioms-interpreting-frequencies.html
>

Best,
Brett

====

a comment on his post, also led to this very interesting blog as well (with
relevant information):
http://learnalanguageortwo.blogspot.com/2009/06/alls-well-in-tv-land.html

===

There is also the Human Speechome Project
http://en.wikipedia.org/wiki/Human_Speechome_Project

and lastly the links provided by Marco below.

Cheers,
Ali

On Tue, Aug 17, 2010 at 12:49 PM, Marco Baroni <marco.baroni at unitn.it>wrote:

> Hi there.
>
> I asked a similar question a few years ago, without much success. I paste
> the summary below.
>
> If you find out more, please keep me posted!
>
> Regards,
>
> Marco
>
> Dear all,
>
> Two weeks ago I asked if somebody knew of work reporting estimates of how
> many words/sentences/etc. (adult) speakers of a language hear/write.
>
> I paste below the responses I got.
>
> Thanks a lot to all who responded!
>
> Regards,
>
> Marco
>
>
> ******************************************
> Reinhard Rapp
> ******************************************
>
> Dear Marco,
>
> I am also interested in the answer to your question. Some discussion
> can be found in a Psychological Review paper by Landauer & Dumais
> (1997) which is on the web at
>
> http://lsa.colorado.edu/papers/plato/plato.annote.html
>
> This is a citation from the most relevant part, which is footnote 6:
>
> ----------- start citation ------------
>
> > From his log-normal model of word frequency distribution and the
> observations in Carroll et al.
>
> (1971), Carroll estimated a total vocabulary of 609,000 words in the
> universe of text to which students through highschool might be exposed.
> Dahl (1979), whose distribution function agrees with a different but
> smaller sample of Howes (1966), found 17,871 word types in 1,058,888 tokens
> of spoken American English, compared to 50,406 in the comparable sized
> adult sample of Kucera & Francis (1967). By Carroll's (1971) model, Dahl's
> data imply a total of roughly 150,000 word types in spoken English, thus
> approximately one-fourth the total, less to the extent that there are
> spoken words that do not appear in print. Moreover, the ratio of spoken to
> printed words to which a particular individual is exposed must be even more
> lopsided because local, ethnic and family usage undoubtedly restrict the
> variety of vocabulary more than published works intended for the general
> school-aged readership.
> If we assume that our seventh-grader has met a total of 50 million word
> tokens of spoken English (140 minutes a day at 100 words per minute for 10
> years) then the expected number of occasions on which the she would have
> heard a spoken word of mean frequency would be about 370. Carroll's
> estimate for the total vocabulary of seventh grade texts is 280,000, and we
> estimate below that the typical student would have read about 3.8 million
> words of print. Thus, the mean number of times she would have seen a
> printed word to which she might be exposed is only about 14. The rest of
> the frequency distributions for heard and seen words, while not
> proportional, would, at every point, show that spoken words have already
> had much greater opportunity to be learned than printed words, so will
> profit much less from an additional occurrence.
>
> ----------- end citation ------------
>
> ...
>
> With kind regards,
>
> Reinhard
>
>
>
> ******************************************
> Paula Newman
> ******************************************
>
> Marco,
> That's an interesting question.  A little googling suggested that a lower
> bound might come from data on the average number of hours of TV watching
> per adult  (multiplied by  average words per minute on TV broadcasts).
> Paula
>
>
>
> ******************************************
> Paul Bennett
> ******************************************
>
>
> Geoffrey Pullum and Barbara Scholze (in Linguistic Review 19, 2002, p44)
> cite
> evidence that by the age of three a child in a professional household might
> have heard 30 million word tokens (but far fewer for children in other
> social
> classes). I know this relates to children rather than adults, but
> presumably
> the amount of language heard does not differ much by age.
>
> Their source is B. Hart and T. Risley: Meaningful Differences in the
> Everyday
> Experiences of Young Children (Paul H Brookes, 1995). I haven't read this,
> but
> I guess this would be a place to look for more information.
>
> Paul Bennett
>
>
>
> ******************************************
> Ilana Bromberg
> ******************************************
>
>
> Marco,
>
> There is some information regarding how much school-age children (up
> through HS I think) read in the following article.  It's possible that some
> of the sources they cite may have more information about adults.
>
> Landuaer, Thomas K and Dumais, Susan T.  1997.  A Solution to Plato's
> Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction,
> and Representation of Knowledge.  Psychological Review, 104:2, 211-240.
>
> Good luck,
> Ilana
>
>
> * Then, there as somebody who wanted to remain anonymous, who answered:
>
>
> I was interested in your query to the list, but had nothing scientific to
> offer. Nevertheless, for many years I have had to make estimates of how much
> of a person's experience of language is represented by a corpus of
> such-and-such a size.  It has been necessary to wow the public by suggesting
> that a query to EDIT scans several years of an individual's language
> experience, and, on the other hand, to convince sponsors that even half a
> billion words is just chickenfeed compared with the amount of text produced
> in a speech community.
>
> In EDIT 15 years ago we established a monitor corpus with 100mw of The
> Times, and discovered that the weekly output of that paper, including The
> Sunday Times, was over half a million words.  Genuine neologisms, and not
> just trivial variations or proper names, were coming in at around a dozen
> every day. But of course not even the most devoted reader gets through
> anything like the whole paper.
>
> Back when I was doing discourse analysis I read somewhere that speech is
> produced at an average of 1500 clauses an hour, and in speech, by my
> calculations at the time, a clause seemed to average 5/6 words.  I imagine
> that reading is not very different from that, maybe towards the faster end,
> but I haven't checked. Then you have to guess how many hours, on average,
> people are engaged in communicative activity, which I put at 12 hours.  1500
> x 6 x 12 gives an estimate of 108000 daily, 39420000 annually.
>
> If you are suspicious about any of the assumptions, you can just change
> them.
>
>

-- 
www.reseed.ca
www.pinkarmy.org

(•`'·.¸(`'·.¸(•)¸.·'´)¸.·'´•) .,.,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100818/4998fa23/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora