The Web as a corpus

Thu Dec 19 16:30:53 UTC 2002

I found an answer to my (deliberately) naive query about using the Web as a
corpus.

>From the latest OED News:

"We have for several decades used electronic databases to aid this work,
notably the British National Corpus and Lexis-Nexis. Now, with the arrival of
the Internet, tens of thousands of scholarly texts and individual works of
literature are available to us in a searchable form.

"However, bigger isn't necessarily better: we need to be discriminating. A
search engine, such as Google, provides a scattergun approach, returning a
vast set of results with no indication of the date or reliability of sources.
We are therefore most interested in material that has been collected together
in databases, where we are able to carry out sophisticated searches (by date,
in proximity to other terms, etc.) and where we can rely on the provenance of
the information we are viewing."

          - Jim Landau