[Corpora-List] Re: problems with Google counts

FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE jfidel at siu.buap.mx
Thu Mar 17 02:25:50 UTC 2005


Hi, Corpora Guys,

Sorry I don't remember who wrote suggesting simply repeating the word in
Google to get a supposedly more realistic count of pages with the word in it
(I had deleted all those messages after reading them).  I tried this
yesterday on a couple of Spanish words (eficaz, eficiente).  (By the way,
the results were apparently consonant with a student's search of the
100,000,000 word corpusdelespañol site.)  Anyway, what repeating the word
apparently does is limit the results to those sites which have the word at
least two times, in this case cutting down on the numbers by roughly 10%.
If that is what is happening, this implies serious problems for relatively
rare words, which may not occur twice in very many pages at all.  At any
rate, the decrease in pages encountered seemed to be about the same
proportionately in both cases.  (We're talking here about roughly 1.5M
original hits.)  If I'm missing the point of the suggestion, please
straighten me out.

Jim

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje, ICSyH
Benemérita Universidad Autónoma de Puebla     MÉXICO



More information about the Corpora mailing list