[Corpora-List] Query on the use of Google for corpus research

Dominic Widdows widdows at maya.com
Fri May 27 13:46:28 UTC 2005


>> Does anyone have any
>>    experience/insight on this?
>>
>
> Well... yes! I made a series of in-depth analyses of Google counts.
> They are totally bogus, and unusable for any kind of serious research.
> There is a summary here :
> http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-
> mystery.html

Dear All,

While I agree with the points made in Jean's excellent summary, I think
it's fair to point out that this was partly motivated by the way
researchers had been using "Google counts" more and more, and coming up
with more and more problems.  As a community of researchers and
peer-reviewers, I still don't think that we've been able to agree on
best practices. I have come across reviews on both sides of the fence,
saying on the one hand:

1. Your method didn't get a very big yield on your fixed corpus, why
didn't you use the Web?

or on the other:

2. Your use of web search engines to get results is unreliable, you
should have used a fixed corpus.

The main problem is that "using the Web" on a large scale puts you at
the mercy of the commercial search engines, which leads to the grim
mess that Jean documents, especially with Google. This situation may
hopefully change as WebCorp (http://www.webcorp.org.uk/) teams up with
a dedicated search engine. In the meantime, it's clearly true that you
can get more results from the web, but you can't vouch for them
properly, and so a community that values both recall and precision is
left reeling.

At the same time, the fact that you can use search engines to get a
rough count of language use in many cases has thrown the door open to a
lot of researchers who have every reason to be interested in language
as a form of data, but have never tried doing much language processing
before. Over the decades, linguists have often been very sniffy about
researchers from other disciplines muscling in out their turf, but this
often results in articles that talk about language just getting
published elsewhere (e.g. in more mainstream media), where the
reviewers are perhaps more favourable. A recent and typical example may
be the "Google Distance" hype
(http://www.newscientist.com/article.ns?id=dn6924) - we've had
conceptual distance, latent semantic analysis, mutual information, etc.
for decades, a couple of mathematicians come along and call something
the "Google distance", and the New Scientist magazine concludes that
the magic of Google has made machines more intelligent.

All right, there's a trace of bitterness here, I wouldn't mind being in
New Scientist for computing semantic distances, but there's a more
serious danger as well - we've been doing a lot of pretty good work for
a long while in different areas of corpus and computational
linguistics, and it would be a shame if other folks went off and
reinvented everything, just because there are more widely available
tools that enable a wider community to "give it a go" and come up with
something that may do pretty well, especially if you're going for
recall. It breaks come fundamental principles such as "do your
experiments in a way that others can replicate them", but this is
naturally on the increase as big-dataset empiricism comes to the
forefront of many scientific problems. For example, there's the recent
research in ocean temperatures that made 7 million temperature readings
at different stations, and none of us can go and replicate that data
collection, but it doesn't invalidate the research.

If we just tell people that search-engine based research is bogus,
people will just keep doing it and publishing it elsewhere, and who
knows, in 10 years time someone using Google or Yahoo counts may invent
part-of-speech tagging, and that will be another amazing thing that
makes computers more intelligent.

Sorry, I haven't got any answers, but I'm writing this in the hope that
someone else on the list has!
Best wishes,
Dominic



More information about the Corpora mailing list