[Corpora-List] Query on the use of Google for corpus research

Mon May 30 17:27:59 UTC 2005

Hi Mark,

Thanks for your response, it certainly sounds like a hopeful direction.

> Actually, I don't think it's really true anymore that large-scale
> corpus
> extraction from the Web necessarily puts you at the mercy of commercial
> search engines. It's no longer very difficult to throw together a
> software
> agent that will crawl the Web directly.

But is it not quite difficult to "throw something together" that
doesn't cause all sorts of traffic problems? I have always shied away
from actually trying this, under the impression that it's a bit of a
dangerous art, but then this is certainly partly due to ignorance.

>  (IOW: The indexing part of
> commercial search engines may be rocket science, but the harvesting
> part
> of them is not.)

That's intriguing, as someone who's worked more in indexing, I'd have
said precisely the opposite :-)
Delighted if I'm wrong.

Is there good reliable software out there, for those who would still be
fearful of hacking up a harvester for themselves?
There is the Internet Archive's Heritrix crawler
(http://crawler.archive.org/). Has anyone used this and found it
suitable for linguistic purposes?

> I think that if you describe your harvesting procedure accurately (what
> you seeded it with, and what filters you used if any), and monitor and
> report on a variety of statistical parameters as your corpus is
> growing,
> there's no reason why the resulting data wouldn't serve as an adequate
> sample for many purposes -- assuming that's what you meant by "vouch
> for
> them properly".

Yes, that is part of what I meant. Do we have a good sense of what
these statistical parameters should be? To what extent is there a code
of practice for saying exactly what you did? Again, we run into
standard empiricist questions - using your proposal, one could
guarantee to reproduce the "initial conditions" of someone's
experiment, but you could at best expect similar outcomes.

This still leaves some of the traditional benefits of corpora
unaccounted for - what about normalising the text content (presuming
the traditional notion that text content is the linguistics phenomenon
you're interested in), tagging, perhaps getting all the data into the
same character set, etc.? These are some of the creature comforts that
organizations such as the LDC have traditionally provided. We can
provide adequate descriptions of what was done with the data, and I
feel that we are even pretty good as a community at making the software
we developed available to others (partly for selfish gene and "please
cite my project!" reasons, but those motivations still benefit the
community at large).

However, there is still the problem that the more sophisticated stuff
you throw at your data, the harder it is for anyone to replicate or
extend your results, and ideally, I would like to see a system where
the data itself is made available as a standard part of practice.
Ideally, we would still work on the same datasets if possible, rather
than duplicating similar datasets for each isoolated project. From an
engineering point of view, storage isn't really a problem here, but
bandwidth is - you have to keep the files you've trawled and processed
on disk somewhere, but you might not be able to foot the bill for other
researchers hitting your web server every time they fancy
half-a-billion words of nice corpus data. To my mind, the only real
solution to this part of the problem is going to be breaking your
corpus up into smaller components and enabling other researchers to
search and copy whichever parts they need in a peer-to-peer fashion. I
gave a talk on this idea recently at the AAACL conference
(http://infomap.stanford.edu/papers/distributed-corpora.pdf), but I
guess this is another story really.

Best wishes,
Dominic