[Corpora-List] Re: problems with Google counts

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Thu Mar 17 09:01:02 UTC 2005


> > (I had deleted all those messages after reading them).  I tried this
> > yesterday on a couple of Spanish words (eficaz, eficiente).  (By the
> > way, the results were apparently consonant with a student's search of
> > the 100,000,000 word corpusdelespañol site.)  Anyway, what repeating
> > the word apparently does is limit the results to those sites which
> > have the word at least two times, in this case cutting down on the
> > numbers by roughly 10%.
>
> Actually that's not the case. When you repeat the word, Google ranks
> first pages that contain the multiword expression you type. For example,
> if you type A B C, you'll see first pages that contain "A B C" exactly,
> if any. In the case of A A, you will see pages that contain exactly "A
> A" first, but pages where A appear only once appera later on.

Well, that can't quite be the case either, at least not today. Things
get really funny (in its "weird" sense, I'm afraid) when you start
looking for more than two repetitions. These are the numbers I just
got from Google 5 minutes ago:

3,560,000,000   the
3,600,000,000   the the
2,800,000,000   the the the
2,830,000,000	the the the the
2,820,000,000	the the the the the
etc.

When you look for non-stop-words, Google seems to make a distinction
between one occurrence and two or more occurrences:

3,110,000   fink
1,970,000   fink fink
1,970,000   fink fink fink
etc.

It would seem that in response to Jean's post, Google has changed
something to enforce consistent results (unless this is just a
side-effect of a new search engine that doesn't support wildcards).

If you go to the German Google site (www.google.de), for instance, you
will still find the old search engine in place (funny that google.de
seems to find more English pages than google.com ...):

8,000,000,000   the
   88,100,000	the the
   87,500,000	the the the
   86,700,000   the the the the
etc.

At least we still have the wildcard "*" for an arbitrary word. For
non-stop-words, the results are consistently inconsistent:

3,460,000   fink
1,900,000   fink fink
1,920,000   fink fink fink
1,870,000   fink fink fink fink
1,910,000   fink fink fink fink fink

I am quite convinced that there is no sensible interpretation of these
queries for which the Google numbers are even remotely plausible.

Stefan.
http://wacky.sslmit.unibo.it/


--
I'm not a nerd. I'm a specialist.
                                   -- from Full Metal Panic, Episode 8
______________________________________________________________________
Stefan Evert                                     purl.org/stefan.evert
http://www.collocations.de/                        stefan.evert at uos.de



More information about the Corpora mailing list