[Corpora-List] Query on the use of Google for corpus research

Mark P. Line mark at polymathix.com
Tue May 31 22:00:53 UTC 2005


Tom Emerson said:
>
> We're obviously talking about differences in many orders of
> magnitude. When you say "some sample texts off the Web" I assume you
> mean a few hundred at most.

I mean as many as it takes to construct a sample to support the study. A
single sample might be 1 million words or 10 million words.

Obviously, there is a break-even point where it starts making more sense
to use high-performance tools and less sense to roll your own.

My points have been that

- the break-even point is significantly greater than zero and probably on
the order of magnitude of 10 million words,
- most academic researchers answer most of their questions on corpora that
are significantly smaller than that,
- such a corpus does not need to be web-exhaustive or even domain-exhaustive,
- source diversity is a parameter that depends on your research questions,
- the researcher can carry out any number of sampling iterations until the
sample has the right characteristics to support the research agenda,
- there's no reason not to expect the research team to do any amount of
eyeballing at any stage in the sampling process, and
- all of this can be done easily and safely with relatively simple and
relatively simple-to-construct tools (pretty much with a naked Java
development kit and a database server).

I'd like to cite this little article for the second time in this thread:

   http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

Are we to assume that Sun has done something utterly unspeakable by
suggesting that a Java developer might have reason to sit down and build
her own web crawler, and here's how?


> Researchers use (or are trying to use) Google to
> quantify linguistic phenomena because it (and the other commercial
> search engines) has a large body of natural language text to work
> with.

Yes, that's where I came into this thread, because somebody expressed
their concern about the construction of web corpora being dependent on
search engines -- to which I replied that it's possible to use a crawler
to harvest texts from the web without using a search engine at all, and
that it's not very difficult to build your own crawler to do just what you
need.

I continue to advise against the use of Google hitcounts to quantify
linguistic phenomena in anything but a grossly informal and exploratory
way. (Is "modeling" more frequent than "modelling"? Does the same hold for
"traveling" and "travelling"?)


> If you grab content from a few dozen sites then your sample size is
> simply too small to make any meaningful statement about the behavior
> you are studying.

What behavior am I studying, and how big is the sample I acquired from the
few dozen sites, in number of words?


>> > Because you may be building a synchronic corpus.
>>
>> I guess I'm going to have to get you to connect the dots for me. How
>> does
>> revisiting sites with some regularity help me to build a synchronic
>> corpus
>> in a way that I cannot build it if I never revisit any site again?
>>
>> Or did you mean a _diachronic_ corpus, in the belief that processes of
>> language change can usefully be detected by means of periodic scans of
>> websites?
>
> Right, I mistyped.

Okay. I doubt that very much could be said about language change by
revisiting websites to track text revisions in them, but if somebody
wanted to try, I don't see that it would be much of a problem for a
home-grown crawler.


>> My point has been that I will not generally *need* more URL's than I can
>> crawl at any one time. I'm not updating the Google index. I'm not
>> acquiring named entities for an exhaustive lexical database or ontology.
>> I'm just collecting enough text to answer certain research questions
>> about my target language.
>
> What is enough text?

What's my research question?


>> Why in the world would I store corpus text as millions of small files,
>> even if I were operating at such a large scale (which, again, again, is
>> not the typical case I've been advising for here)?
>
> Well, a naive crawler will do just that.

So, you're saying that nobody who builds their own crawler is going to
have a clue about any more sophisticated means of data management than
dropping millions of small files into the file system.

Why do you say that?


> Heck, just grab 'wget' and
> let it go. You'll mirror the whole site on your disk. Simple.

You already accepted in an earlier post that corpus linguists do *not*
typically need scalable, high-performance crawlers to capture web corpora
safely. So what's the need for hyperbole here?


-- Mark

Mark P. Line
Polymathix
San Antonio, TX



More information about the Corpora mailing list