Corpora: Number of pages on the Internet

Patrick Corliss patrick at quad.net.au
Mon Dec 3 18:36:01 UTC 2001


On Mon, 3 Dec 2001 14:48:26 +0000 (GMT), Hristo Tanev wrote:

> The question is: approximately how many  pages in English
> exist in Internet?

I see that you have received a good reply from Assoc Professor William H.
Fletcher.  I would make particular mention of the so called "deep web".  I
don't know the English language percentage but see the URL at:

"BrightPlanet's search technology automates the process of making dozens of
direct queries simultaneously using multiple-thread technology and thus is the
only search technology, so far, that is capable of identifying, retrieving,
qualifying, classifying, and organizing both "deep" and "surface" content."

http://www.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp

One of the pages includes "Deep Web Sites" which indicates that the 60 known,
largest deep Web sites contain data of about 750 terabytes (HTML included
basis), or roughly 40 times the size of the known surface Web. These sites
appear in a broad array of domains from science to law to images and commerce.
The total number of records or documents within this group is about 85
billion.

Basically, the folks at BrightPlanet found that "Deep Web sources store their
content in searchable databases that only produce results dynamically in
response to a direct request." Ordinary "spider" indexing of "surface" web
sites misses this content, which BrightPlanet says is truly vast:

*    Public information on the deep Web is currently 400 to 550 times larger
than the commonly defined World Wide Web.
*    The deep Web contains 7,500 terabytes of information compared to nineteen
terabytes of information in the surface Web.
*    The deep Web contains nearly 550 billion individual documents compared to
the one billion of the surface Web.
*    More than 200,000 deep Web sites presently exist.
*    Sixty of the largest deep-Web sites collectively contain about 750
terabytes of information -- sufficient by themselves to exceed the size of the
surface Web forty times.
*    On average, deep Web sites receive fifty per cent greater monthly traffic
than surface sites and are more highly linked to than surface sites; however,
the typical (median) deep Web site is not well known to the Internet-searching
public.
*    The deep Web is the largest growing category of new information on the
Internet.
*    Deep Web sites tend to be narrower, with deeper content, than
conventional surface sites.
*    Total quality content of the deep Web is 1,000 to 2,000 times greater
than that of the surface Web.
*    Deep Web content is highly relevant to every information need, market,
and domain.
*    More than half of the deep Web content resides in topic-specific
databases.
*    A full ninety-five per cent of the deep Web is publicly accessible
information -- not subject to fees or subscriptions.

To put these findings in perspective, a study at the NEC Research Institute
(1), published in Nature estimated that the search engines with the largest
number of Web pages indexed (such as Google or Northern Light) each index no
more than sixteen per cent of the surface Web.

With thanks to Tony Barry of the Australian [LINK] mailing list for drawing it
to my attention with the posting below.  Also to Jan Whitaker of JLWhitaker
Associates, Melbourne, Victoria, Australia <jwhit at primenet.com> for the
most of the above expansion (which includes her commentary).
http://www.primenet.com/~jwhit/whitentr.htm

On Sat, 20 Jan 2001 14:10:50 +1100, Tony Barry <me at Tony-Barry.emu.id.au>
wrote to: <link at www.anu.edu.au>
Subject: [LINK] Deep web

> Extracted item for information.
>
> Source: THE NET NEWS
> From Alan Farrelly
> January 20, 2001
>
> - - - - -
> DEEPEST WEB
> The Deep Web, "hidden" under the surface Web, is much bigger than originally
thought. The Deep Web consists of those searchable databases that only produce
results dynamically in response to a direct request.

> Ordinary   indexing of surface sites misses this vast content.  Public
information on the deep Web is currently 500 times larger than the commonly
defined World Wide Web, with  7,500 terabytes of data, compared to 20
terabytes on the surface Web. That's  550 billion individual documents - while
Google today offers a search of just 1,326,920,000 web pages. More at
http://www.completeplanet.com/tutorials/deepweb/index.asp
>
> DEEP NET  NEWS!
Net News has done its bit for the Deep Web. Four of those terabytes are in
the huge newspaper text and picture databases we've built over the last year -
searchable text at http://www.newstext.com.au and viewable pictures at
http://www.newsphotos.com.au and  http://www.newspix.com.au - tens of
millions of articles and photos available to anyone.
>
> GREY LADY EXPANDS
And the Deep Net gets deeper. The  New York Times is expanding its archives
to include digital images of every page published from 1851 to 1998. The 3.5
million pages are being digitised as part of a licensing deal with  Bell and
Howell:
http://biz.yahoo.com/prnews/010112/dc_bell_ho.html
> --
> phone  +61 2 6241 7659
> mailto:me at Tony-Barry.emu.id.au
> http://purl.oclc.org/NET/Tony.Barry

Best regards
Patrick Corliss



More information about the Corpora mailing list