[Corpora-List] problems with Google counts

Jean Veronis Jean.Veronis at up.univ-mrs.fr
Mon Mar 14 19:24:23 UTC 2005


Thanks, Lillian, for citing this study (a series of studies, indeed,
since the saga continues).

I think that it is very important that we, linguists, analyse very
closely what engines offer to us, if we are (as more and more of us are
tempted) going to do "Google linguistics". My conclusion, unfortunately,
is that counts are totally unreliable with Google. When I say
unreliable, is not just a few percent uncertainty, as you can see in my
posts. MSN seems to cheat us as well:

http://aixtal.blogspot.com/2005/02/web-msn-cheating-too.html

Yahoo delivers more credible results, and so far, I have been able to
use it satisfactorily. Unfortunately, last week, I found that, all of a
sudden, they have exactly doubled their index size (without announcing
it officially). So far, so good, but if you look at the figures, you'll
see that the correlation between the previous ones and the new is so
high (R2 > 0.99) that it is very difficult to accept that the doubling
is due to natural growth:

http://aixtal.blogspot.com/2005/03/web-yahoo-double-ses-comptes.html

The solution is, as Adam says, to build our own open engine, and I am
deeply convinced  that such a project is one of the highest priorities
for our community.

--j
http://aixtal.blogspot.com


ps: It's probably off-topic on this list, but I find it extremely scary
that our access to the world information goes through the bottleneck of
not even a handful of extremely opaque search engines. Beyond counts, they
can just decide what we see, or don't. Big Brother feeling.



More information about the Corpora mailing list