[Corpora-List] amount of text on the web?

Wed Nov 14 13:52:46 UTC 2007

Go to Berkeley: www.sims.berkeley.edu and see the project How much information! Search for 'information statistics' to get estimation of knowledge stocks and information flows of various media.

  István.

radev at umich.edu wrote:
  > 
> Drago,
> 
> If we are talking about text, isn't it better to count in words than =
> bytes.
> (How do you count texts in scanned images? We don't really want to say =
> that
> 500 words of a scanned image count for 1000 times as much as 500 words =
> of
> ASCII.) =20

Words is fine.

> 
> Then, we can refer to Google basing Web1T on 10^12 words of English. Of
> course that is only what Google finds, not what is there, and it is only
> English. But they will have taken tasks like distinguishing text from
> non-text, and deduplication, seriously, which must be a good thing if =
> the
> question is asked from a linguistic or NLP perspective.

According to
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
the Google corpus was indeed based on processing 1 (US) Trillion words
(10^12 words), however there is no indication that this represents all
the textual data that Google has indexed. I doubt that this is the case.

I was invited to edit a special issue of IEEE Intelligent Systems on
"NLP using and for the Web" (title to be finalized) and I realized
that we (or at least I) don't even know accurately how much text is on
the Web. Adam, you were one of the earliest proponents of using the
Web as a corpus. Do you know what is the largest corpus study (in
terms of the size of the underlying data set) ever done in NLP?

Drago

> 
> While the Berkeley reference is clearly a key one, I was surprised =
> simply at
> the extent to which it showed up more questions than answers. If that's =
> the
> best guess (at least in 2003) at how much is out there, our collective =
> level
> of ignorance really is stunning. (Though I can't help thinking that the =
> big
> guys - Google, Yahoo, Microsoft, IBM - will have better answers that =
> they
> don't publish)
> 
> Adam
> 
> 
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf =
> Of
> radev at umich.edu
> Sent: 13 November 2007 17:37
> To: Constantin Orasan
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] amount of text on the web?
> 
> This is too old. I have seen this one and quoted it a lot.
> 
> >=20
> > Hi,
> >=20
> > The numbers are a bit old but a very good study which investigates how
> > much data is on the web is:
> >=20
> > Lyman, Peter and Hal R. Varian (2003) How much information =
> =3DE2=3D80=3D93
> 2003.
> > Technical report, School of Information Management and Systems,
> > University of California at Berkeley.
> >=20
> > http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
> >=20
> > Regards
> >=20
> > Constantin
> >=20
> > > I am looking for some up to date statistics on the amount of textual
> > > data on the web. I have seen varied estimates ranging up to 1
> > > Exabyte. I am sure that it is not possible to define precisely what
> > > "text on the web" means (do you include email, cached text, local
> > > files, "hidden" web, etc).
> > >=3D20
> > > Drago
> >=20
> > --=3D20
> > Constantin Orasan 
> > Lecturer in Computational Linguistics
> > Research Group in Computational Linguistics
> > http://www.wlv.ac.uk/~in6093/
> > University of Wolverhampton
> >=20
> >=20
> 
> 
> --=20
> Dragomir R. Radev Associate Professor
> SI, CSE, Ling U. Michigan, Ann Arbor=20
> http://www.eecs.umich.edu/~radev radev at umich.edu =20
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 
> 
> 

-- 
Dragomir R. Radev Associate Professor
SI, CSE, Ling U. Michigan, Ann Arbor 
http://www.eecs.umich.edu/~radev radev at umich.edu 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

---------------------------------
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20071114/f246a5b7/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora