[Corpora-List] Query on the use of Google for corpus research

Tue May 31 20:54:01 UTC 2005

Mark P. Line writes:
[...]
> But none of this is new, and none of it is going to be much of a problem
> for a researcher who merely wants to capture some sample texts off the
> Web.

We're obviously talking about differences in many orders of
magnitude. When you say "some sample texts off the Web" I assume you
mean a few hundred at most.

[...]
> And you believe that's typical for linguists wishing to capture a research
> corpus from the Web?

Yes, I hope so. Researchers use (or are trying to use) Google to
quantify linguistic phenomena because it (and the other commercial
search engines) has a large body of natural language text to work
with.

If you grab content from a few dozen sites then your sample size is
simply too small to make any meaningful statement about the behavior
you are studying. That is one reason the crawls that I do are so
large.

> Again, do you believe that's typical for linguists wishing to capture a
> research corpus from the Web?

Yes. The BNC is 100 million words. The LOB is 1 million words. The
Brown Corpus is 1 million words. The LDC has Chinese, English, and
Arabic gigaword corpora. The UN parallel text corpus has almost 150
million words across three languages.

So yes, I would say that researchers who are looking to build their
own corpora want to crawl at the scale that I am.

Also, as I mentioned before, I fully expect that 40-60% of the
documents I get in my crawls will end up being discarded.

> (You'll note that the subject line of this thread still says something
> about "corpus research". I didn't think this was ever about
> high-performance product development.)

Much of my corpus work doesn't end up directly in our products, FWIW.

> It would be an insignificant burden on leonardo (my Linux machine) to
> track hundreds of millions of URL's if I wanted to.

Undoubtedly so: the machine I'm running my big crawl on can handle
this just fine. But there *is* a cost. Currently the heritrix state
database for my large crawl weighs in at 88 GB on disk, compared to 43
GB for the compressed content I've downloaded (fortunately HTML
compresses well.) I'm currently pulling 2.5 MB/s through the crawler,
which is capped by our IT staff since without that I as consuming
almost all of our available bandwidth. Doing a non-trivial crawl will
use a lot of resources.

[...]
> > Because you may be building a synchronic corpus.
>
> I guess I'm going to have to get you to connect the dots for me. How does
> revisiting sites with some regularity help me to build a synchronic corpus
> in a way that I cannot build it if I never revisit any site again?
>
> Or did you mean a _diachronic_ corpus, in the belief that processes of
> language change can usefully be detected by means of periodic scans of
> websites?

Right, I mistyped.

> Why would I ignore their robot exclusion rules? This assumption surprises
> me, since you have expressed concern that readers of this thread might be
> encouraged to do things that webmasters might not like.

I'm not implying that you yourself would, but it is surprising the
number of times people inquire about why they can't slurp the entire
New York Times or Washington Post sites.

> My point has been that I will not generally *need* more URL's than I can
> crawl at any one time. I'm not updating the Google index. I'm not
> acquiring named entities for an exhaustive lexical database or ontology.
> I'm just collecting enough text to answer certain research questions about
> my target language.

What is enough text?

> Why in the world would I store corpus text as millions of small files,
> even if I were operating at such a large scale (which, again, again, is
> not the typical case I've been advising for here)?

Well, a naive crawler will do just that. Heck, just grab 'wget' and
let it go. You'll mirror the whole site on your disk. Simple.

> I think we're starting to see the outlines of a paradigm divide here. :)

I think so.

> Many are happy to have gotten the grant money to acquire anything more
> than an office computer in the first place.

This I have no argument with. It is often the same in industry,
contrary to what many may think. ;-)

Peace,

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"