[Corpora-List] amount of text on the web?

Tue Nov 13 19:05:44 UTC 2007

> 
> Drago,
> 
> If we are talking about text, isn't it better to count in words than =
> bytes.
> (How do you count texts in scanned images?  We don't really want to say =
> that
> 500 words of a scanned image count for 1000 times as much as 500 words =
> of
> ASCII.) =20

Words is fine.

> 
> Then, we can refer to Google basing Web1T on 10^12 words of English.  Of
> course that is only what Google finds, not what is there, and it is only
> English.  But they will have taken tasks like distinguishing text from
> non-text, and deduplication, seriously, which must be a good thing if =
> the
> question is asked from a linguistic or NLP perspective.

According to
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
the Google corpus was indeed based on processing 1 (US) Trillion words
(10^12 words), however there is no indication that this represents all
the textual data that Google has indexed. I doubt that this is the case.

I was invited to edit a special issue of IEEE Intelligent Systems on
"NLP using and for the Web" (title to be finalized) and I realized
that we (or at least I) don't even know accurately how much text is on
the Web. Adam, you were one of the earliest proponents of using the
Web as a corpus. Do you know what is the largest corpus study (in
terms of the size of the underlying data set) ever done in NLP?

Drago

> 
> While the Berkeley reference is clearly a key one, I was surprised =
> simply at
> the extent to which it showed up more questions than answers.  If that's =
> the
> best guess (at least in 2003) at how much is out there, our collective =
> level
> of ignorance really is stunning.  (Though I can't help thinking that the =
> big
> guys - Google, Yahoo, Microsoft, IBM - will have better answers that =
> they
> don't publish)
> 
> Adam
> 
> 
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf =
> Of
> radev at umich.edu
> Sent: 13 November 2007 17:37
> To: Constantin Orasan
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] amount of text on the web?
> 
> This is too old. I have seen this one and quoted it a lot.
> 
> >=20
> > Hi,
> >=20
> > The numbers are a bit old but a very good study which investigates how
> > much data is on the web is:
> >=20
> > Lyman, Peter and Hal R. Varian (2003)  How much information =
> =3DE2=3D80=3D93
> 2003.
> > Technical report, School of Information Management and Systems,
> > University of California at Berkeley.
> >=20
> > http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
> >=20
> > Regards
> >=20
> > Constantin
> >=20
> > > I am looking for some up to date statistics on the amount of textual
> > > data on the web. I have seen varied estimates ranging up to 1
> > > Exabyte. I am sure that it is not possible to define precisely what
> > > "text on the web" means (do you include email, cached text, local
> > > files, "hidden" web, etc).
> > >=3D20
> > > Drago
> >=20
> > --=3D20
> > Constantin Orasan <C.Orasan at wlv.ac.uk>
> > Lecturer in Computational Linguistics
> > Research Group in Computational Linguistics
> > http://www.wlv.ac.uk/~in6093/
> > University of Wolverhampton
> >=20
> >=20
> 
> 
> --=20
> Dragomir R. Radev                    Associate Professor
> SI, CSE, Ling                     U. Michigan, Ann Arbor=20
> http://www.eecs.umich.edu/~radev         radev at umich.edu             =20
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 
> 
> 

-- 
Dragomir R. Radev                    Associate Professor
SI, CSE, Ling                     U. Michigan, Ann Arbor 
http://www.eecs.umich.edu/~radev         radev at umich.edu              

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora