FW: The ANC (was: Using the BNC)

Thu Dec 19 13:27:59 UTC 2002

Michael Q replied to James L's query as shown below.

I would merely add that by "tools to research it" Michael means, in part,
software that will display related "hits" in concordance format, sortable to
the left or to the right of the key word/phrase, and sortable to a
specifiable range of letters or words.  Also, proximity searches are
essential.

If someone could write software that would do just THAT much (there are
other useful corpus-searching tools, too, but the ones mentioned would do,
for a start) with Web data, then yes, the Web would be a corpus, in the lexo
sense.  It would not be "balanced", but it would be a corpus, and MUCH more
useful than it is now for lexo research (lexical, too).

If you need to see what concordance format looks like, see the Intro to
either the New Oxford Dict of English (UK) or the New Oxford American Dict.
Many US libraries have the latter.

Oh, btw, even if the Web were a corpus, we would still need researchers such
as Fred Shapiro and Barry Popik, not to mention a bunch of lexos and
linguists.  And we would not throw out our citation files -- different tool,
for a different purpose.

Frank Abate

************************************

James A. Landau wrote:

> Can't the World Wide Web be considered as a corpus?  Of the
> English-language Web pages, most are contemporary, most are
> colloquial, and most are unedited. There exist search tools (such
> as the Google search engine that has been discussed on this
> thread).

It's an invaluable research tool for certain purposes.

But, as I mentioned earlier, the text on the Web is full of systemic
biases that make using it for corpus purposes a risky business. It's
impossible, for example, to restrict a search to a specific national
English form (or even text produced by native English speakers), to a
particular register, or even a given period (there's a surprisingly,
albeit pleasantly, large amount of historical material online, even
outside the formal e-text collections such as Project Gutenberg).

The tools to research it in the same way as one does a corpus don't
exist; I'm not at all sure how easy it would be to create them for
such an inchoate mass of material.

Having said that, I've used Web searches - for example - to get a
feel for the relative frequency of occurrence of terms, though even
here the presence of so many glossaries and Weird Words sites (guilty
as charged, M'Lud) biases the results substantially for rare items.

--
Michael Quinion
Editor, World Wide Words
E-mail: <TheEditor at worldwidewords.org>
Web: <http://www.worldwidewords.org/>