[Corpora-List] web-corpora, big and small

Thu Jun 2 00:39:55 UTC 2005

Marco Baroni said:
>
> I'll tell you more: there is a link to it from the faqs of heritrix,
> probably the most popular publicly available crawler. We do this so that
> the simple-minded crawlers written by naive Java developers are doomed.
> ;-)

Isn't it amazing what a difference one little smiley can make. Just
imagine if you'd forgotten to add it: people might think you were
suggesting that somebody here is a naive Java developer who writes
simple-minded crawlers. Fortunately, given the imperviousness of the
smiley hedge, nobody could possibly think you were suggesting that. Thanks
to the adamantine shield of the smiley, everybody will think you are
making a good-humored and well-intentioned joke.

> Well, for example I would think that corpus-based ontology building,
> lexicon extraction and named entity recognition qualify as legit
> activities for corpus linguists, whereas I gather from your replies to
> Tom Emerson that you are very confident that a corpus linguist could not
> possibly be interested in that.

Why would you gather that (other than the fact that you started reading at
mid-thread)? I've said some things about the typical needs of corpus
linguists and about the particular problem domain being discussed in this
thread. How do you get from that to an assertion that no corpus linguist
could possibly be interested in anything else? I've written quite the
opposite more than once in this thread, saying that there's a break-even
point where home-grown tools will have to make way for high-performance
off-the-shelf tools. What's the need for hyperbole here?

>> By what procedure did you arrive at 1 billion words as your required
>> sample size? Why not 500 million or 5 billion?
>
> 1 billion words is an arbitrary starting point -- chosen to be as big as
> the largest existing corpora we are aware of.

So, you give higher priority to being able to show that yours is as big as
anybody else's than to efficient allocation of your time and money?

It's not about how big it is, it's all about how you use it.

> I certainly do not hesitate to ask specific questions to this or other,
> sometimes more appropriate lists (such as the heritrix crawler list), and
> I'm glad that corpus linguists and crawlers are such friendly and helpful
> comminities. My point was simply that retrieving large-ish corpora from
> the web (at least if you want them to be composed of non-duplicate,
> natural, connected text) is not a trivial task, as I (mis?)understood you
> were implying.

Aren't you mincing words here? Okay, I can play.

I have said that it's not difficult to build software to do this, and it's
not. If you disagree and want to debate the point, then you should try to
show why it's difficult. (Showing that it's difficult for _you_ is not
enough: you should show that it's difficult in principle.) Stating that
the task is not trivial does not rebut the claim that it's not difficult,
because lots of tasks (including most everyday software development tasks)
are neither trivial nor difficult.

-- Mark

Mark P. Line
Polymathix
San Antonio, TX