[Corpora-List] Query on the use of Google for corpus research

Mark P. Line mark at polymathix.com
Mon May 30 14:29:26 UTC 2005


Dominic Widdows said:
>
> The main problem is that "using the Web" on a large scale puts you at
> the mercy of the commercial search engines, which leads to the grim
> mess that Jean documents, especially with Google.

Actually, I don't think it's really true anymore that large-scale corpus
extraction from the Web necessarily puts you at the mercy of commercial
search engines. It's no longer very difficult to throw together a software
agent that will crawl the Web directly. (IOW: The indexing part of
commercial search engines may be rocket science, but the harvesting part
of them is not.)


> This situation may hopefully change as WebCorp
> (http://www.webcorp.org.uk/) teams up with
> a dedicated search engine. In the meantime, it's clearly true that you
> can get more results from the web, but you can't vouch for them
> properly, and so a community that values both recall and precision is
> left reeling.

I think that if you describe your harvesting procedure accurately (what
you seeded it with, and what filters you used if any), and monitor and
report on a variety of statistical parameters as your corpus is growing,
there's no reason why the resulting data wouldn't serve as an adequate
sample for many purposes -- assuming that's what you meant by "vouch for
them properly".


-- Mark

Mark P. Line
Polymathix
San Antonio, TX



More information about the Corpora mailing list