[Corpora-List] Query on the use of Google for corpus research

Mon May 30 14:29:26 UTC 2005

Dominic Widdows said:
>
> The main problem is that "using the Web" on a large scale puts you at
> the mercy of the commercial search engines, which leads to the grim
> mess that Jean documents, especially with Google.

Actually, I don't think it's really true anymore that large-scale corpus
extraction from the Web necessarily puts you at the mercy of commercial
search engines. It's no longer very difficult to throw together a software
agent that will crawl the Web directly. (IOW: The indexing part of
commercial search engines may be rocket science, but the harvesting part
of them is not.)

> This situation may hopefully change as WebCorp
> (http://www.webcorp.org.uk/) teams up with
> a dedicated search engine. In the meantime, it's clearly true that you
> can get more results from the web, but you can't vouch for them
> properly, and so a community that values both recall and precision is
> left reeling.

I think that if you describe your harvesting procedure accurately (what
you seeded it with, and what filters you used if any), and monitor and
report on a variety of statistical parameters as your corpus is growing,
there's no reason why the resulting data wouldn't serve as an adequate
sample for many purposes -- assuming that's what you meant by "vouch for
them properly".

-- Mark

Mark P. Line
Polymathix
San Antonio, TX