[Corpora-List] Query on the use of Google for corpus research

Fri May 27 12:14:09 UTC 2005

Hello,

I would recommend looking at the following reference as it is highly
related:
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moriez.
Analysis of a very large Altavista Query Log. Technical Report 1998-014,
Digital SRC, 1998.
http://gatekeeper.dec.com/pub/DEC/SRC/technicalnotes/abstracts/src-tn-1998-014.html

There are some interesting issues with regard to examining such data.
The first that really comes to mind is that you have to be able to
distinguish between search sessions. This is non-trivial as users
typically do not have a single goal when searching; there is some work
by Spink on this topic. Both gathering this query data at the client
side and at the server side have their own set of problems.

As statistics are being gathered, it is important to discuss properties
of the user group (sample population) being evaluated. Depending on the
diversity of the sample (or lack of it) will determine what kind of
conclusions can be made.

Hope that helps,

Chris

Peter K Tan wrote:

> Just forwarding a question from a colleague. Would be grateful for
> comments.
>
> Cheers,
> Peter
>
>     From: Michelle Maria Lazar
>     Sent: 27 May 2005 11.27
>     To: Peter K W Tan; Talib, I S; Vincent Ooi; Wee Hock Ann, Lionel
>     Subject: Query on the use of Google for corpus research
>
>     Hi all,
>
>     Someone has written to ask me whether there's any foreseeable
>     problem/objection in using Google to gather statistical evidence
>     on particular language usage, using key word searches. It involves
>     a submission of an article currently under review. Does anyone
>     have any experience/insight on this?
>
>     Cheers,
>
>     Michelle
>