The ANC (was: Using the BNC)

Thu Dec 19 09:30:40 UTC 2002

James A. Landau wrote:

> Can't the World Wide Web be considered as a corpus?  Of the
> English-language Web pages, most are contemporary, most are
> colloquial, and most are unedited. There exist search tools (such
> as the Google search engine that has been discussed on this
> thread).

It's an invaluable research tool for certain purposes.

But, as I mentioned earlier, the text on the Web is full of systemic
biases that make using it for corpus purposes a risky business. It's
impossible, for example, to restrict a search to a specific national
English form (or even text produced by native English speakers), to a
particular register, or even a given period (there's a surprisingly,
albeit pleasantly, large amount of historical material online, even
outside the formal e-text collections such as Project Gutenberg).

The tools to research it in the same way as one does a corpus don't
exist; I'm not at all sure how easy it would be to create them for
such an inchoate mass of material.

Having said that, I've used Web searches - for example - to get a
feel for the relative frequency of occurrence of terms, though even
here the presence of so many glossaries and Weird Words sites (guilty
as charged, M'Lud) biases the results substantially for rare items.

--
Michael Quinion
Editor, World Wide Words
E-mail: <TheEditor at worldwidewords.org>
Web: <http://www.worldwidewords.org/>