[Corpora-List] Google "region"-based searches

Adam Kilgarriff adam at lexmasterclass.com
Wed Nov 28 10:15:21 UTC 2012


Googleology is bad science.  Being at the mercy of every slight change in
syntax or interpretation of Google's unpublished, undocumented search
syntax is horrible.  We need to move to more robust, less dependent
approaches.  If you have a web-scale corpus on your machine, you don't need
Google.  We have recently encoded English Clueweb (70b words) in the Sketch
Engine - see LREC 2012
paper<http://www.lrec-conf.org/proceedings/lrec2012/pdf/1047_Paper.pdf>.
 (Work supported by EU PRESEMT Project.) Others can use the same data -
from Carnegie Mellon - and our procedures and scripts to give themselves
this dataset too.  Access to our version also a possibility

Adam

On 28 November 2012 09:49, Tristan Miller <
miller at ukp.informatik.tu-darmstadt.de> wrote:

> Greetings.
>
> On 28/11/12 12:00 AM, John F Sowa wrote:
> > In ancient times (pre 21st century), Google supported Boolean
> > expressions for searching.  But now it's impossible to control
> > their search in any predictable fashion.
> >
> > For example, I wanted to count the number of web pages that used
> > the phrase "enterprise integration pattern" and the word 'sql'.
> >
> > But when I type just "enterprise integration pattern" by itself,
> > I get 114,000 hits.  When I add another word, the number should
> > decrease.  But the following combination gets 137,000 hits:
> >
> >    "enterprise integration pattern" sql
> >
> > The following combination gets 274,000 hits:
> >
> >    "enterprise integration pattern" java
> >
> > And the following gets 25,900,000 hits:
> >
> >    "enterprise integration pattern" java sql
> >
> > I get the same numbers with a one-line search or with
> > their so-called advanced search.
> >
> > Does anybody know how to bypass the Google heuristics and
> > force it to use a simple regular expression for searching?
>
> Google used to support a "+" modifier for search terms; this instructed
> the search to return only those pages which include the search terms.
> (Without the modifier, Google was free to disregard the search terms at
> its discretion.)  The "+" modifier was dropped, probably for marketing
> reasons, once Google+ was introduced.  Supposedly you can now achieve
> the same effect by putting the "required" terms in quotation marks, and
> in my experience this works most of the time.  For your examples, it
> appears that sometimes it does and sometimes it doesn't:
>
>    "enterprise integration pattern"
>
> gets 117,000 hits, but oddly both
>
>    "enterprise integration pattern" sql
>
> and
>
>    "enterprise integration pattern" "sql"
>
> get 137,000 results.  On the other hand,
>
>    "enterprise integration pattern" java sql
>
> gets 25,800,000 results, but
>
>    "enterprise integration pattern" "java" "sql"
>
> returns a more sensible 8520 results.
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Doctoral Researcher
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121128/40aaa32b/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list