Hi all,<br><br>Thanks for the responses. I got a number of offline responses, here is a collation of all the pointers I received.<br>While there doesn't seem to be one conclusive or even complete study, piecing together various studies I think gives a pretty good estimate. See the second blog post below for a fairly thorough <span style="color: rgb(153, 153, 153);">(and sourced!)</span> analysis.<br>
<br>====<br>Brett Reynolds:<br><br>All I can offer you is my own back-of-the-envelope calculation here:<br>
<br>
<<a href="http://english-jack.blogspot.com/2007/07/idioms-interpreting-frequencies.html" target="_blank">http://english-jack.blogspot.com/2007/07/idioms-interpreting-frequencies.html</a>><br>
<br>
Best,<br>
Brett<br><br>====<br><br>a comment on his post, also led to this very interesting blog as well (with relevant information):<br><a href="http://learnalanguageortwo.blogspot.com/2009/06/alls-well-in-tv-land.html">http://learnalanguageortwo.blogspot.com/2009/06/alls-well-in-tv-land.html</a><br>
<br>===<br><br>There is also the Human Speechome Project<br><a href="http://en.wikipedia.org/wiki/Human_Speechome_Project">http://en.wikipedia.org/wiki/Human_Speechome_Project</a><br><br>and lastly the links provided by Marco below.<br>
<br>Cheers,<br>Ali<br><br><div class="gmail_quote">On Tue, Aug 17, 2010 at 12:49 PM, Marco Baroni <span dir="ltr"><<a href="mailto:marco.baroni@unitn.it">marco.baroni@unitn.it</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Hi there.<br>
<br>
I asked a similar question a few years ago, without much success. I paste the summary below.<br>
<br>
If you find out more, please keep me posted!<br>
<br>
Regards,<br>
<br>
Marco<br>
<br>
Dear all,<br>
<br>
Two weeks ago I asked if somebody knew of work reporting estimates of how<br>
many words/sentences/etc. (adult) speakers of a language hear/write.<br>
<br>
I paste below the responses I got.<br>
<br>
Thanks a lot to all who responded!<br>
<br>
Regards,<br>
<br>
Marco<br>
<br>
<br>
******************************************<br>
Reinhard Rapp<br>
******************************************<br>
<br>
Dear Marco,<br>
<br>
I am also interested in the answer to your question. Some discussion<br>
can be found in a Psychological Review paper by Landauer & Dumais<br>
(1997) which is on the web at<br>
<br>
<a href="http://lsa.colorado.edu/papers/plato/plato.annote.html" target="_blank">http://lsa.colorado.edu/papers/plato/plato.annote.html</a><br>
<br>
This is a citation from the most relevant part, which is footnote 6:<br>
<br>
----------- start citation ------------<br>
<br>
> From his log-normal model of word frequency distribution and the<br>
observations in Carroll et al.<br>
<br>
(1971), Carroll estimated a total vocabulary of 609,000 words in the<br>
universe of text to which students through highschool might be exposed.<br>
Dahl (1979), whose distribution function agrees with a different but<br>
smaller sample of Howes (1966), found 17,871 word types in 1,058,888 tokens<br>
of spoken American English, compared to 50,406 in the comparable sized<br>
adult sample of Kucera & Francis (1967). By Carroll's (1971) model, Dahl's<br>
data imply a total of roughly 150,000 word types in spoken English, thus<br>
approximately one-fourth the total, less to the extent that there are<br>
spoken words that do not appear in print. Moreover, the ratio of spoken to<br>
printed words to which a particular individual is exposed must be even more<br>
lopsided because local, ethnic and family usage undoubtedly restrict the<br>
variety of vocabulary more than published works intended for the general<br>
school-aged readership.<br>
If we assume that our seventh-grader has met a total of 50 million word<br>
tokens of spoken English (140 minutes a day at 100 words per minute for 10<br>
years) then the expected number of occasions on which the she would have<br>
heard a spoken word of mean frequency would be about 370. Carroll's<br>
estimate for the total vocabulary of seventh grade texts is 280,000, and we<br>
estimate below that the typical student would have read about 3.8 million<br>
words of print. Thus, the mean number of times she would have seen a<br>
printed word to which she might be exposed is only about 14. The rest of<br>
the frequency distributions for heard and seen words, while not<br>
proportional, would, at every point, show that spoken words have already<br>
had much greater opportunity to be learned than printed words, so will<br>
profit much less from an additional occurrence.<br>
<br>
----------- end citation ------------<br>
<br>
...<br>
<br>
With kind regards,<br>
<br>
Reinhard<br>
<br>
<br>
<br>
******************************************<br>
Paula Newman<br>
******************************************<br>
<br>
Marco,<br>
That's an interesting question. A little googling suggested that a lower<br>
bound might come from data on the average number of hours of TV watching<br>
per adult (multiplied by average words per minute on TV broadcasts).<br>
Paula<br>
<br>
<br>
<br>
******************************************<br>
Paul Bennett<br>
******************************************<br>
<br>
<br>
Geoffrey Pullum and Barbara Scholze (in Linguistic Review 19, 2002, p44) cite<br>
evidence that by the age of three a child in a professional household might<br>
have heard 30 million word tokens (but far fewer for children in other social<br>
classes). I know this relates to children rather than adults, but presumably<br>
the amount of language heard does not differ much by age.<br>
<br>
Their source is B. Hart and T. Risley: Meaningful Differences in the Everyday<br>
Experiences of Young Children (Paul H Brookes, 1995). I haven't read this, but<br>
I guess this would be a place to look for more information.<br>
<br>
Paul Bennett<br>
<br>
<br>
<br>
******************************************<br>
Ilana Bromberg<br>
******************************************<br>
<br>
<br>
Marco,<br>
<br>
There is some information regarding how much school-age children (up<br>
through HS I think) read in the following article. It's possible that some<br>
of the sources they cite may have more information about adults.<br>
<br>
Landuaer, Thomas K and Dumais, Susan T. 1997. A Solution to Plato's<br>
Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction,<br>
and Representation of Knowledge. Psychological Review, 104:2, 211-240.<br>
<br>
Good luck,<br>
Ilana<br>
<br>
<br>
* Then, there as somebody who wanted to remain anonymous, who answered:<br>
<br>
<br>
I was interested in your query to the list, but had nothing scientific to offer. Nevertheless, for many years I have had to make estimates of how much of a person's experience of language is represented by a corpus of such-and-such a size. It has been necessary to wow the public by suggesting that a query to EDIT scans several years of an individual's language experience, and, on the other hand, to convince sponsors that even half a billion words is just chickenfeed compared with the amount of text produced in a speech community.<br>
<br>
In EDIT 15 years ago we established a monitor corpus with 100mw of The Times, and discovered that the weekly output of that paper, including The Sunday Times, was over half a million words. Genuine neologisms, and not just trivial variations or proper names, were coming in at around a dozen every day. But of course not even the most devoted reader gets through anything like the whole paper.<br>
<br>
Back when I was doing discourse analysis I read somewhere that speech is produced at an average of 1500 clauses an hour, and in speech, by my calculations at the time, a clause seemed to average 5/6 words. I imagine that reading is not very different from that, maybe towards the faster end, but I haven't checked. Then you have to guess how many hours, on average, people are engaged in communicative activity, which I put at 12 hours. 1500 x 6 x 12 gives an estimate of 108000 daily, 39420000 annually.<br>
<br>
If you are suspicious about any of the assumptions, you can just change them.<br>
<br>
</blockquote></div><br><br clear="all"><br>-- <br><a href="http://www.reseed.ca">www.reseed.ca</a><br><a href="http://www.pinkarmy.org">www.pinkarmy.org</a><br><br>(•`'·.¸(`'·.¸(•)¸.·'´)¸.·'´•) .,., <br>