[Corpora-List] problems with Google counts

Nancy Ide ide at cs.vassar.edu
Wed Mar 16 18:48:38 UTC 2005


Are people aware of the Linguist's Search Engine developed at
University of Maryland, for doing linguistic searches on internet data?
URL is http://lse.umiacs.umd.edu

On Mar 16, 2005, at 1:26 PM, Ring Low wrote:

> A few years ago I did a study of the uses of the definite article THE
> in English using Google search (the data was collected in 2003).  I
> used Internet search engine to conduct the study partially because I
> wanted to get the page-counts, which would exclude repeat instances in
> the same text (i.e., rather than the absolute frequencies).
> I gathered about 1500 nouns and put it into the search engine using
> two strings "the * N" and "the N".  I also did the same for other
> pre-nominal elements such as "a", "this", "that", "my", "his", "her".
> Other criteria I used at that time were "in text only" and  "English
> only".
>
> The inconsistency I found, at that time, was that the sum of the
> frequencies I obtained for all the nouns with one element is always
> much more than the frequency reported in a single search for that
> element, i.e., the sum of all "the N" was much larger than the search
> of the word "the" alone in the Google database, which did puzzle me.
>
> On the other hand, I did find some consistencies on the data.  First,
> the ratio of the frequencies among each search are always about the
> same, even I did all the search a couple times among several months.
> In addition, the relative frequencies among the nouns at that time, as
> far as the ones that I could check, was consistent with the data I
> found in some other corppora I found (e.g., if one find that a word is
> of a relatively high frequency in Google, one would also find that
> word having a relative high frequency in other texts).
> I agree that using Google to conduct linguistic studies has gotten
> more and more difficult since then, as the design of the search engine
> has been changing due to commercial reasons.  We do need a search
> engine design specically for linguistic studies.  On the other hand,
> before such a search engine is available, some other ways to avoid
> problmetic results might be to adjust the design of the study
> according to some known weaknesses of the engine and to cross-check
> the results manually with tranditional corpora and other search
> engines.
>
>
>
> --
> ==============================
> Ring Low
> mlow at acsu.buffalo.edu
> http://www.acsu.buffalo.edu/~mlow/
> ==============================
>
>
>
> Lillian Lee wrote:
>
>> Dear list members,
>>
>> You might be interested to know that until approximately March 8th,
>> Google counts appear to have been quite off (inflation rates of a
>> factor of 66%?), according to Jean Veronis.
>>
>> In a blog post of February 8th
>> (
>> http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-
>> mystery.html ),
>> Veronis summarized his earlier findings:
>>
>>  # If you type Chirac OR Sarkozy, you get half the number results of
>>    Chirac alone, which may have a political explanation... but is a
>>    weird approach to boolean logic.
>>
>>  # If you search the in the English pages, you get 1% of the number
>>    you get for the all languages together. Does this mean that the is
>>    99 times more frequent in languages other than English? Of course
>>    not.
>>
>> He gave a possible explanation and noted that "if you want to know the
>> real index count for any word, simply type it twice".
>>
>> On March 13th, he noted that the counts seem to have been adjusted,
>> that is "changed in a major way":
>> http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html
>>
>> Related posts indicate problems with MSN, the possibility that Yahoo
>> indexes more pages than Google, and more details on calculations.
>> ________________________________________________________________
>> Lillian Lee, Assoc. Prof.    tel: 607-255-8119
>> Dept of Computer Science     fax: 607-255-4428 Cornell University
>>       llee at cs.cornell.edu   Ithaca, NY 14853-7501 USA
>> www.cs.cornell.edu/home/llee
>> ________________________________________________________________
>>
>>
>>
>>
>>
>>
>
>
>
>
=======================================================

Nancy Ide

Professor  of Computer Science
Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================



More information about the Corpora mailing list