Corpora: Number of pages on the Internet

AssocProf William H. Fletcher fletcher at usna.edu
Mon Dec 3 16:39:29 UTC 2001


There really is no way to measure the ever-growing number of
pages on the Internet, and every study I have seen comes to very
different conclusions, based on their sampling techniques and
extrapolations from them.  Reliable sources suggest that the
number of pages publicly accessible from links (as opposed to say
database queries) is on the order of 2-3 billion (i.e. 1000
million=10^9), and that is perhaps 20% of all information online.
Of these, about 58% are in English (according to Alex Franz of
Google.)

The last large-scale sampling I know of was done by Inktomi in
Jan 2000.  They counted 1.6 billion pages and showed this
language distribution: English 86.55%, German 5.83%, French
2.36%, Italian 1.55%, Spanish 1.23%, Portuguese 0.85%, Dutch
0.54%, Finnish 0.50%, Swedish 0.36%, Japanese 0.34%.  Since these
figures add up to 100% while excluding most languages, they
obviously do not give the complete picture.
http://www.inktomi.com/webmap/

I have spent many days looking into this and have always been
disappointed by the inconclusive results. Perhaps the most
interesting trend is reveal in this self-quote, echoed by others:

Historically English-language users and content have overshadowed
other languages on the Internet, but the trend away from the
preponderance of English seems clear. Statistics compiled by
Global Reach illustrate the long-term development. In 1996,
four-fifths of the 50 million Internet users were native speakers
of English. By September 2001 Anglophones constituted only 43% of
the world's online population of 503 million.  Global Reach
expects their share to fall below 30% of the 850 million Web
users projected for 2005. The anticipated phenomenal growth in
this non-Anglophone Web population should spur tremendous
expansion of online resources in tongues other than English,
particularly the smaller non-Western ones, to the benefit of
those who teach, learn, and investigate these languages.  [Global
reach's current estimates of users by language:   English 43%,
Chinese 9.3%, Japanese 9.2%, Spanish 6.7%, German 6.7%, Korean
4.4%, Italian 3.8%, French 3.3%, Portuguese 2.5%, Dutch 2.2%,
Other 8.9%. ]

If anyone has fresher reliable estimates I'd love to hear about
it.

Regards,
Bill Fletcher

Here are some sources, many based on / derived from each other.

Excellent but dated study (explains how to sample and estimate--
excellent background information):
Lawrence, S. & C. L. Giles. (1999). Accessibility of Information
on the Web. Nature, 400: 107-109. Summary, commentary, update and
download at http://www.wwwmetrics.com

This study concludes 85% of information was from USA; authors do
not plan to update the study:
Moore, A. & Murray, B.H. (2000). Sizing the Internet. July 10,
2000. Arlington, VA: Cyveillance, Inc. Retrieved 8 October 2000
from the World Wide Web:
http://www.cyveillance.com/resources/7921S_Sizing_the_Internet.pd
f

Agence de la Francophonie's "L5 The Fifth Study on Languages and
the Internet" http://funredes.org/LC/english/L5/L5overview.html
studies the presence  on the Internet of English, German, and the
Romance languages

Study of number of USERS per language; methodologically sound:
Global Internet Statistics (by Language). San Francisco, CA:
Global Reach Retrieved 6 October 2001 from the World Wide Web:
http://www.glreach.com/globstats/index.php3

Comparable figures:
Nua Internet How Many Online. Dublin: Nua Ltd. Retrieved 8
October 2001 from the World Wide Web:
http://www.nua.ie/surveys/how_many_online/index.html and regional
subpages.

Lots of information gleaned from various sources:
Estadísticas de Internet en el ámbito internacional Madrid:
Asociación de Usuarios de Internet. Retrieved 6 November 2001
from the World Wide Web:
http://www.aui.es/estadi/internacional/internacional.htm

Interesting methodology -- tries to estimate number of WORDS, not
PAGES per language, but restricted to select Western European
languages:
Grefenstette, Gregory & Julien Nioche. (2000)  Estimation of
English and non-English Language Use on the WWW. RIAO 2000,
Paris, 12-14 April 2000.  Retrieved 12 October 2001 from the
World Wide Web:
http://www.xrce.xerox.com/research/mltt/publications/Documents/P1
9137/content/RIAO2000gref.pdf

------------------------------------------------------------
Further quotes from my paper
Concordancing the Web with KWiCFinder, William H. Fletcher,
United States Naval Academy

Submitted for publication in proceedings of
North American Association for Applied Corpus Linguistics
Third North American Symposium on Corpus Linguistics and Language
Teaching, Boston, MA, 23-25 March 2001


The World Wide Web is a wondrous place, with an overwhelming
variety of information in countless languages and domains. Just
how many webpages there are and how they are distributed by
language and content are not easy questions to answer. The Web is
constantly growing and changing, and even the best estimates can
only approximate its extent and composition.  Studies of the
nature of the Web echo the story of the blind men and the
elephant:  each extrapolates from different samples of an
ever-evolving entity taken at different times and by divergent
means.  The most reliable estimates suggest that the number of
publicly-indexable webpages in mid-2001 falls in the range of two
to five billion (i.e. thousand million = 109), a number projected
to grow to 10-15 billion by mid-decade.

These two billion-plus pages are only the visible tip of the
iceberg. For a page to be indexable, there must be a valid link
to it from another publicly accessible site, which excludes the
many pages with restricted access. Far larger is the vast
"invisible web" of content in databases, which can only be evoked
by entering relevant queries in a text box, and text materials
stored in formats which are not typically indexed, such as word
processor, Post Script and Adobe Acrobat files.

Despite the overall size of this corpus, one language, English,
continues to predominate. Studies conducted in 2000 by Inktomi
and Cyveillance conclude that over 85% of  publicly-accessible
webpages are in English, but here again even the best-informed
estimates vary widely.  In the summer of 2001 the Agence de la
Francophonie released L5: the Fifth Study of Language and the
Internet, based on these studies and the one by Global Reach
cited below, complemented by research into the numbers of
webpages in various languages returned by search engines. This
report investigates the relative presence of the Romance
languages, German, and English among online documents.  It shows
strong growth among the non-English languages in the proportion
of webpages found relative to English, concluding that the number
of webpages in each is roughly proportional to the number of Web
users with that language as native tongue. Data from these and
other studies of linguistic diversity on the Web are summarized
in this note.



- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 - -

  William H. Fletcher                         (410) 293-6362
[voice]
  Associate Professor of German and Spanish   (410) 293-2729
[fax]
  Language Studies Department
  US Naval Academy
  589 McNair Road
  Annapolis, MD 21402 - 5030

  fletcher at usna.edu
  http://www.usna.edu/LangStudy/

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 - - -



More information about the Corpora mailing list