[Corpora-List] Average Daily Word Exposure

David Wible wible at stringnet.org
Thu Aug 19 02:40:58 UTC 2010


Dear all,
This thread and the second blog link Brett included both remind me of a
related issue: exposure to textual input (reading). My interest in this has
a story that goes like this:

There is growing pressure in graduate programs in countries and universities
where English is not the medium of education for graduate students to
publish in international journals (in English), say in journals listed in
SSCI, SCI, A&HI etc. At the same time, there is precious little awareness
evident on the part of those imposing this requirement (at least where I
work) of the enormity of such an expectation. As a consequence, the
resources devoted to the level of language instruction, mentoring, practice
needed to bring these students to the level of language proficiency needed
to produce such publications is scandaluosly underestimated...I think.

So I'm wondering if there are any estimates on what amount of exposure to,
say, English text (reading) it is reasonable to assume that authors
typically have under their belts who publish in the sorts of journals these
students are expected to publish in. My real agenda is that it would help
make a concrete case for devoting more language education resources to these
students (perhaps esp for reading).

Thanks.
David Wible


On 8/18/10, Ali SH <asaegyn+out at gmail.com <asaegyn%2Bout at gmail.com>> wrote:
>
> Hi all,
>
> Thanks for the responses. I got a number of offline responses, here is a
> collation of all the pointers I received.
> While there doesn't seem to be one conclusive or even complete study,
> piecing together various studies I think gives a pretty good estimate. See
> the second blog post below for a fairly thorough (and sourced!) analysis.
>
> ====
> Brett Reynolds:
>
> All I can offer you is my own back-of-the-envelope calculation here:
>
> <
> http://english-jack.blogspot.com/2007/07/idioms-interpreting-frequencies.html
> >
>
> Best,
> Brett
>
> ====
>
> a comment on his post, also led to this very interesting blog as well (with
> relevant information):
> http://learnalanguageortwo.blogspot.com/2009/06/alls-well-in-tv-land.html
>
> ===
>
> There is also the Human Speechome Project
> http://en.wikipedia.org/wiki/Human_Speechome_Project
>
> and lastly the links provided by Marco below.
>
> Cheers,
> Ali
>
> On Tue, Aug 17, 2010 at 12:49 PM, Marco Baroni <marco.baroni at unitn.it>wrote:
>
>> Hi there.
>>
>> I asked a similar question a few years ago, without much success. I paste
>> the summary below.
>>
>> If you find out more, please keep me posted!
>>
>> Regards,
>>
>> Marco
>>
>> Dear all,
>>
>> Two weeks ago I asked if somebody knew of work reporting estimates of how
>> many words/sentences/etc. (adult) speakers of a language hear/write.
>>
>> I paste below the responses I got.
>>
>> Thanks a lot to all who responded!
>>
>> Regards,
>>
>> Marco
>>
>>
>> ******************************************
>> Reinhard Rapp
>> ******************************************
>>
>> Dear Marco,
>>
>> I am also interested in the answer to your question. Some discussion
>> can be found in a Psychological Review paper by Landauer & Dumais
>> (1997) which is on the web at
>>
>> http://lsa.colorado.edu/papers/plato/plato.annote.html
>>
>> This is a citation from the most relevant part, which is footnote 6:
>>
>> ----------- start citation ------------
>>
>> > From his log-normal model of word frequency distribution and the
>> observations in Carroll et al.
>>
>> (1971), Carroll estimated a total vocabulary of 609,000 words in the
>> universe of text to which students through highschool might be exposed.
>> Dahl (1979), whose distribution function agrees with a different but
>> smaller sample of Howes (1966), found 17,871 word types in 1,058,888
>> tokens
>> of spoken American English, compared to 50,406 in the comparable sized
>> adult sample of Kucera & Francis (1967). By Carroll's (1971) model, Dahl's
>> data imply a total of roughly 150,000 word types in spoken English, thus
>> approximately one-fourth the total, less to the extent that there are
>> spoken words that do not appear in print. Moreover, the ratio of spoken to
>> printed words to which a particular individual is exposed must be even
>> more
>> lopsided because local, ethnic and family usage undoubtedly restrict the
>> variety of vocabulary more than published works intended for the general
>> school-aged readership.
>> If we assume that our seventh-grader has met a total of 50 million word
>> tokens of spoken English (140 minutes a day at 100 words per minute for 10
>> years) then the expected number of occasions on which the she would have
>> heard a spoken word of mean frequency would be about 370. Carroll's
>> estimate for the total vocabulary of seventh grade texts is 280,000, and
>> we
>> estimate below that the typical student would have read about 3.8 million
>> words of print. Thus, the mean number of times she would have seen a
>> printed word to which she might be exposed is only about 14. The rest of
>> the frequency distributions for heard and seen words, while not
>> proportional, would, at every point, show that spoken words have already
>> had much greater opportunity to be learned than printed words, so will
>> profit much less from an additional occurrence.
>>
>> ----------- end citation ------------
>>
>> ...
>>
>> With kind regards,
>>
>> Reinhard
>>
>>
>>
>> ******************************************
>> Paula Newman
>> ******************************************
>>
>> Marco,
>> That's an interesting question.  A little googling suggested that a lower
>> bound might come from data on the average number of hours of TV watching
>> per adult  (multiplied by  average words per minute on TV broadcasts).
>> Paula
>>
>>
>>
>> ******************************************
>> Paul Bennett
>> ******************************************
>>
>>
>> Geoffrey Pullum and Barbara Scholze (in Linguistic Review 19, 2002, p44)
>> cite
>> evidence that by the age of three a child in a professional household
>> might
>> have heard 30 million word tokens (but far fewer for children in other
>> social
>> classes). I know this relates to children rather than adults, but
>> presumably
>> the amount of language heard does not differ much by age.
>>
>> Their source is B. Hart and T. Risley: Meaningful Differences in the
>> Everyday
>> Experiences of Young Children (Paul H Brookes, 1995). I haven't read this,
>> but
>> I guess this would be a place to look for more information.
>>
>> Paul Bennett
>>
>>
>>
>> ******************************************
>> Ilana Bromberg
>> ******************************************
>>
>>
>> Marco,
>>
>> There is some information regarding how much school-age children (up
>> through HS I think) read in the following article.  It's possible that
>> some
>> of the sources they cite may have more information about adults.
>>
>> Landuaer, Thomas K and Dumais, Susan T.  1997.  A Solution to Plato's
>> Problem: The Latent Semantic Analysis Theory of the Acquisition,
>> Induction,
>> and Representation of Knowledge.  Psychological Review, 104:2, 211-240.
>>
>> Good luck,
>> Ilana
>>
>>
>> * Then, there as somebody who wanted to remain anonymous, who answered:
>>
>>
>> I was interested in your query to the list, but had nothing scientific to
>> offer. Nevertheless, for many years I have had to make estimates of how much
>> of a person's experience of language is represented by a corpus of
>> such-and-such a size.  It has been necessary to wow the public by suggesting
>> that a query to EDIT scans several years of an individual's language
>> experience, and, on the other hand, to convince sponsors that even half a
>> billion words is just chickenfeed compared with the amount of text produced
>> in a speech community.
>>
>> In EDIT 15 years ago we established a monitor corpus with 100mw of The
>> Times, and discovered that the weekly output of that paper, including The
>> Sunday Times, was over half a million words.  Genuine neologisms, and not
>> just trivial variations or proper names, were coming in at around a dozen
>> every day. But of course not even the most devoted reader gets through
>> anything like the whole paper.
>>
>> Back when I was doing discourse analysis I read somewhere that speech is
>> produced at an average of 1500 clauses an hour, and in speech, by my
>> calculations at the time, a clause seemed to average 5/6 words.  I imagine
>> that reading is not very different from that, maybe towards the faster end,
>> but I haven't checked. Then you have to guess how many hours, on average,
>> people are engaged in communicative activity, which I put at 12 hours.  1500
>> x 6 x 12 gives an estimate of 108000 daily, 39420000 annually.
>>
>> If you are suspicious about any of the assumptions, you can just change
>> them.
>>
>>
>
>
> --
> www.reseed.ca
> www.pinkarmy.org
>
> (•`'·.¸(`'·.¸(•)¸.·'´)¸.·'´•) .,.,
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100819/f6300eed/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list