[Corpora-List] web-corpora, big and small

Tue May 31 23:20:25 UTC 2005

Did I mention the Corpus Linguistics 2005 Web-as-Corpus worskhop? ;-)

http://sslmit.unibo.it/~baroni/web_as_corpus_cl05.html

> > http://spidrs.must.dye.notttttttt/ [obfuscated]
>
> So, you've just inserted a link to a spider trap into the Corpora-List
> archive?

I'll tell you more: there is a link to it from the faqs of heritrix,
probably the most popular publicly available crawler. We do this so that
the simple-minded crawlers written by naive Java developers are doomed.
;-)

> > Moreover, as spammers are getting smarter all the time, anti-spammers are
> > also becoming more sophisticated -- suppose that somebody built a spider
> > track by generating random _sentences_ instead of words: that would be
> > very hard to detect...
>
> Can you show me a list of random sentences that can fool any native
> speaker into believing it's a valid text?

Suppose I eyeball a random sample of my data by hand. I estimate that 20%
of what I collected is random sentences from various spider traps. Then,
I'll still need to identify that 20% in some automated way in order to
discard it (of course, automation is needed only if my corpus is big, but
size is undoubtedly one of the reasons why "some" corpus linguists -- the
evil ones, of course -- are attracted by the web, ).

> You have to get away from the high-tech product development paradigm of
> "by human hands untouched" to the scruffy, underfunded, underpowered
> paradigm in which undergraduate interns eyeball the results of each
> night's run to see if anything obviously bogus came through.

In my experience, humans cost more than machines, and unfortunately I do
not have access to an unlimited supply of undergraduate interns.

> But I'm having more and more difficulty
> understanding why we can't just focus in this thread on the much
> smaller-scale problem actually at hand: on-the-fly capture of sample texts
> for a linguistic research corpus.

By the time I joined this conversation, it was already about spider traps
and such things (but you'll notice I changed the topic just in case).

> But if you tried to sell me an inference from that web sample about the
> distribution of word senses of "spider" in written English, much less
> English full-stop, then I wouldn't be buying: I'd point out the flaw in
> your research design. Such an inference would be over-generalized and
> almost certainly not justified on the basis of your sample, because your
> sample would not have been representative of written English, much less
> English full-stop.

True. But there is a lot of recent empirical work (e.g., by Peter Turney)
indicating that, despite its unrepresentativeness, the web, by its sheer
size, can teach us things about the meaning of English words (English as
in English full-stop) that do not emerge from smaller, carefully balanced
corpora (success is often measured by comparison with human performance).

I think that this has something to do with the fact that, while the
underlying statistical population is "html English", for certain tasks and
purposes html English is similar enough to English-period (or at least
written-English-period) that html English is a good surrogate of
English-period.

That, and the issue of the Zipfian-ness of word frequency distributions in
corpora, which makes BNC-sized corpora too small to even try to use them
for some tasks, so that one has to go for larger data-sets, although they
will typically not be balanced nor representative like the BNC is meant to
be.

And for languages other than English often the web is the only way to
build even BNC-sized corpora...

<dangerous_aside> I am also not so convinced that a language can be
identified with the population of all sentences ever produced (or
currently being produced?)  in that language, in the same way in which I
suppose that in sociology or geography it makes sense to define the
population of, say, Californians as made of all the people living in
California.  Which means that I'm nost sure that we are on much more solid
grounds when drawing inferences about "English"  from a good,
old-fashioned balanced corpus... but this is another story.
</dangerous_aside>

> of what qualifies as corpus linguistics may differ from that of others
> with equal or greater exposure to the field, I guess I'm surprised at the
> notion that I wouldn't know corpus linguistics when I see it.

Well, for example I would think that corpus-based ontology building,
lexicon extraction and named entity recognition qualify as legit
activities for corpus linguists, whereas I gather from your replies to Tom
Emerson that you are very confident that a corpus linguist could not
possibly be interested in that.

> By what procedure did you arrive at 1 billion words as your required
> sample size? Why not 500 million or 5 billion?

We (since luckily I am not alone in this:  http://wacky.sslmit.unibo.it --
although what I'm saying here only represents my own intepretation of why
I'm doing this) would like to have as many data as possible, both for
exploratory studies of what the web has to offer to linguists and because
we are interested in seeing how the behaviour of certain methods,
measures, algorithms changes as sample size increases. 1 billion words is
an arbitrary starting point -- chosen to be as big as the largest existing
corpora we are aware of.

> That said, if you do need a corpus that big and you really don't know how
> to build one from web data with the characteristics you need, and you're
> reasonably confident that the characteristics can be achieved with a web
> sample, then there are probably several of us here who could help you. You
> could start a new thread, since that's a very different problem domain
> from the one we've been addressing here -- one that would certainly profit
> from a high-performance off-the-shelf crawler and other components.

I certainly do not hesitate to ask specific questions to this or other,
sometimes more appropriate lists (such as the heritrix crawler list), and
I'm glad that corpus linguists and crawlers are such friendly and helpful
comminities. My point was simply that retrieving large-ish corpora from
the web (at least if you want them to be composed of non-duplicate,
natural, connected text) is not a trivial task, as I (mis?)understood you
were implying.

Regards,

Marco