[Corpora-List] problems with Google counts

Adam Kilgarriff adam at lexmasterclass.com
Mon Mar 14 16:32:36 UTC 2005


Both problem and solution are both simple (intellectually, if not
technically):

Problem:
	Google's goals are keeping its customers happy, and we (NLP/web
research community) are not a significant proportion of its customers,
and we are the only people who care about the accuracy of counts.

Solution:
	don't use Google to get web counts: set up and use a search
engine with a scientific, not a commercial, mission instead.

This is my current research agenda (see eg
http://www.lexmasterclass.com/people/Publications/2003-K-LSEsprolac.pdf
)  see also http://wacky.sslmit.unibo.it/

Adam Kilgarriff

-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Lillian Lee
Sent: 14 March 2005 15:47
To: CORPORA at uib.no
Subject: [Corpora-List] problems with Google counts


Dear list members,

You might be interested to know that until approximately March 8th,
Google counts appear to have been quite off (inflation rates of a
factor of 66%?), according to Jean Veronis.

In a blog post of February 8th
(
http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.htm
l ),
Veronis summarized his earlier findings:

  # If you type Chirac OR Sarkozy, you get half the number results of
    Chirac alone, which may have a political explanation... but is a
    weird approach to boolean logic.

  # If you search the in the English pages, you get 1% of the number
    you get for the all languages together. Does this mean that the is
    99 times more frequent in languages other than English? Of course
    not.

He gave a possible explanation and noted that "if you want to know the
real index count for any word, simply type it twice".

On March 13th, he noted that the counts seem to have been adjusted,
that is "changed in a major way":
http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html

Related posts indicate problems with MSN, the possibility that Yahoo
indexes more pages than Google, and more details on calculations.

________________________________________________________________
Lillian Lee, Assoc. Prof.    tel: 607-255-8119
Dept of Computer Science     fax: 607-255-4428
Cornell University           llee at cs.cornell.edu
Ithaca, NY 14853-7501 USA    www.cs.cornell.edu/home/llee
________________________________________________________________



More information about the Corpora mailing list