[Corpora-List] Query on the use of Google for corpus research

Mark P. Line mark at polymathix.com
Tue May 31 17:08:10 UTC 2005


Marco Baroni said:
>> > How do you deal with spider traps?
>>
>> Why would spider traps be a concern (apart from knowing to give up on
>> the site if my IP address has been blocked by their spider trap) when
>> all I'm doing is constructing a sample of text data from the Web?
>
> First of all, your crawler has to understand that it fell into a trap.
> Second, some spider traps generate dynamic pages containing random text
> for you to follow -- now, that's a problem if you're trying to build a
> linguistic corpus, isn't it?

It's not much of a problem unless you presuppose that a corpus linguist
would have difficulty finding a way to distinguish between a valid text in
her target language and a random text generated by a spider trap.


> Incidentally, a "spider trap" query on google returns many more results
> about crawlers, robots.txt files etc. than about how to capture
> eight-legged arachnids... one good example of how one should be careful
> when using the web as a way to gather knowledge about the world...

I believe there's a huge difference between using the web as a way to
gather knowledge about the world (especially if this is being done
automatically) and using the web as a way to populate a corpus for
linguistic research. The latter use is much less ambitious, and simply
doesn't need to be weighed down by most of the concerns that web-mining or
indexing applications do.

Most corpus linguists who are constructing a dataset on the fly are just
interested in being able to track their samples back to the underlying
population, and are usually willing to add or change samples indefinitely
until their corpus has the characteristics they need. If web-served HTML
and plaintext is adequate to support their research questions, then a
simple web crawler will work just fine.


-- Mark

Mark P. Line
Polymathix
San Antonio, TX



More information about the Corpora mailing list