[Corpora-List] Re: problems with Google

Tom Emerson tree at basistech.com
Sat Mar 19 20:14:38 UTC 2005


Pascal Soucy writes:
> Googles does that with all stopwords. If you search for:
> what does "the" "the" mean, you'll get the same behavior. Google ignores
> stopwords (and * seems to managed as a stopword).

Not really. Two identical stopwords in succession are kept. Try a
search for "The The" (a band from the late '80s) and you will get hits
on the determiner usage in isolation. You also get different hits for
a search of simply "the".

    -tree

> Both the queries:
>
> what does "*" mean
>
> and
>
> what does "*" "*" mean
>
> results in about the same list of documents. The difference between the two
> occurs in the ranking process. The ranking algorithm likely use term proximity
> so to better match the query as it is written and it keep the position of
> stopwords in the query to do that.
>
> Pascal Soucy
> Coveo
>
> Selon John Milton <lcjohn at ust.hk>, 17.03.2005:
>
> > I just discovered that Google seems to have retained some use of the
> > wildcard for words if you use double quotes with the asterisk. A search
> > for "what does "*" mean" and "what does "*" "*" mean" results MAINLY in
> > any one and two words respectively. If anyone else is using web searches
> > as language learning/teaching resources, this also looks promising:
> > http://www.findforward.com/
> >
> > John Milton
> > Hong Kong University of Science & Technology
> >
> >
> >
> >
>
>
>
>

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



More information about the Corpora mailing list