[Corpora-List] amount of text on the web?
Adam Kilgarriff
adam at lexmasterclass.com
Tue Nov 13 18:52:28 UTC 2007
Drago,
If we are talking about text, isn't it better to count in words than bytes.
(How do you count texts in scanned images? We don't really want to say that
500 words of a scanned image count for 1000 times as much as 500 words of
ASCII.)
Then, we can refer to Google basing Web1T on 10^12 words of English. Of
course that is only what Google finds, not what is there, and it is only
English. But they will have taken tasks like distinguishing text from
non-text, and deduplication, seriously, which must be a good thing if the
question is asked from a linguistic or NLP perspective.
While the Berkeley reference is clearly a key one, I was surprised simply at
the extent to which it showed up more questions than answers. If that's the
best guess (at least in 2003) at how much is out there, our collective level
of ignorance really is stunning. (Though I can't help thinking that the big
guys - Google, Yahoo, Microsoft, IBM - will have better answers that they
don't publish)
Adam
-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
radev at umich.edu
Sent: 13 November 2007 17:37
To: Constantin Orasan
Cc: corpora at hd.uib.no
Subject: Re: [Corpora-List] amount of text on the web?
This is too old. I have seen this one and quoted it a lot.
>
> Hi,
>
> The numbers are a bit old but a very good study which investigates how
> much data is on the web is:
>
> Lyman, Peter and Hal R. Varian (2003) How much information =E2=80=93
2003.
> Technical report, School of Information Management and Systems,
> University of California at Berkeley.
>
> http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
>
> Regards
>
> Constantin
>
> > I am looking for some up to date statistics on the amount of textual
> > data on the web. I have seen varied estimates ranging up to 1
> > Exabyte. I am sure that it is not possible to define precisely what
> > "text on the web" means (do you include email, cached text, local
> > files, "hidden" web, etc).
> >=20
> > Drago
>
> --=20
> Constantin Orasan <C.Orasan at wlv.ac.uk>
> Lecturer in Computational Linguistics
> Research Group in Computational Linguistics
> http://www.wlv.ac.uk/~in6093/
> University of Wolverhampton
>
>
--
Dragomir R. Radev Associate Professor
SI, CSE, Ling U. Michigan, Ann Arbor
http://www.eecs.umich.edu/~radev radev at umich.edu
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list