[Corpora-List] WebCorp counts

Jean Veronis Jean.Veronis at up.univ-mrs.fr
Wed Apr 27 12:04:43 UTC 2005


Antoinette Renouf a écrit :

>Problems with Google counts were discussed recently on this list: http://torvald.aksis.uib.no/corpora/2005-1/0191.html <http://torvald.aksis.uib.no/corpora/2005-1/0191.html> .
>
>
Right, and unfortunately, despite major turbulence since February
(indicating major sofware and database changes) Google's counts are
still completely mess up.

Just an example from a few minutes ago :

*94,200,000* for **bush
***85,600,000* for **bush*
<http://www.google.com/url?sa=X&oi=dict&q=http://www.answers.com/bush%26r%3D67>*
OR **corpora*
<http://www.google.com/url?sa=X&oi=dict&q=http://www.answers.com/corpora%26r%3D67>*.

George Boole "doit se retourner dans sa tombe" as we say-- I don't know
how "turning in his grae" translates in English, but you get the
picture. I have no financial links with Yahoo, but I would like to point
out that I've switched to Yahoo Search for all my linguistic work, and
they hit counts seem quite reliable (I don't mean true nor honest,
simply that they seem correlated with some kind of corpus reality).

I agree with Antoinette that hit counts are not the same as word counts,
but they are still usable in many studies, for instance when you compare
term frequency between subsets of the Web, in which you can assume (more
or less safely) that the average document length is comparable. If you
read French, you can find an example of this in my morning post about
Yes or No in the European Constitution related pages :

http://aixtal.blogspot.com/2005/04/web-cest-plutt-non.html

However, the real solution for us would be our own crawler and search
engine as discussed before.

--j
  http://aixtal.blogspot.com



More information about the Corpora mailing list