[Corpora-List] problems with Google counts

Wed Mar 16 18:26:48 UTC 2005

A few years ago I did a study of the uses of the definite article THE in
English using Google search (the data was collected in 2003).  I used
Internet search engine to conduct the study partially because I wanted
to get the page-counts, which would exclude repeat instances in the same
text (i.e., rather than the absolute frequencies).
I gathered about 1500 nouns and put it into the search engine using two
strings "the * N" and "the N".  I also did the same for other
pre-nominal elements such as "a", "this", "that", "my", "his", "her".
Other criteria I used at that time were "in text only" and  "English only".

The inconsistency I found, at that time, was that the sum of the
frequencies I obtained for all the nouns with one element is always much
more than the frequency reported in a single search for that element,
i.e., the sum of all "the N" was much larger than the search of the word
"the" alone in the Google database, which did puzzle me.

On the other hand, I did find some consistencies on the data.  First,
the ratio of the frequencies among each search are always about the
same, even I did all the search a couple times among several months.  In
addition, the relative frequencies among the nouns at that time, as far
as the ones that I could check, was consistent with the data I found in
some other corppora I found (e.g., if one find that a word is of a
relatively high frequency in Google, one would also find that word
having a relative high frequency in other texts).

I agree that using Google to conduct linguistic studies has gotten more
and more difficult since then, as the design of the search engine has
been changing due to commercial reasons.  We do need a search engine
design specically for linguistic studies.  On the other hand, before
such a search engine is available, some other ways to avoid problmetic
results might be to adjust the design of the study according to some
known weaknesses of the engine and to cross-check the results manually
with tranditional corpora and other search engines.

--
==============================
Ring Low
mlow at acsu.buffalo.edu
http://www.acsu.buffalo.edu/~mlow/
==============================

Lillian Lee wrote:

>Dear list members,
>
>You might be interested to know that until approximately March 8th,
>Google counts appear to have been quite off (inflation rates of a
>factor of 66%?), according to Jean Veronis.
>
>In a blog post of February 8th
>( http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html ),
>Veronis summarized his earlier findings:
>
>  # If you type Chirac OR Sarkozy, you get half the number results of
>    Chirac alone, which may have a political explanation... but is a
>    weird approach to boolean logic.
>
>  # If you search the in the English pages, you get 1% of the number
>    you get for the all languages together. Does this mean that the is
>    99 times more frequent in languages other than English? Of course
>    not.
>
>He gave a possible explanation and noted that "if you want to know the
>real index count for any word, simply type it twice".
>
>On March 13th, he noted that the counts seem to have been adjusted,
>that is "changed in a major way":
>http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html
>
>Related posts indicate problems with MSN, the possibility that Yahoo
>indexes more pages than Google, and more details on calculations.
>
>________________________________________________________________
>Lillian Lee, Assoc. Prof.    tel: 607-255-8119
>Dept of Computer Science     fax: 607-255-4428
>Cornell University           llee at cs.cornell.edu
>Ithaca, NY 14853-7501 USA    www.cs.cornell.edu/home/llee
>________________________________________________________________
>
>
>
>
>
>
>