Corpora: The Web as a corpus

Tue May 2 14:15:44 UTC 2000

The WebCorp Project

Research and Development for English Studies
University of Liverpool
U.K.

Dear corpus linguists

However large and up-to-date the electronic text corpora available
are, there will always be aspects of the language which are too rare
or too new to be evidenced in them. For some time, this Unit has
therefore been developing an Internet search tool which allows on-line
access to Web texts as linguistic rather than information sources.

The prototype version of the tool can be tested at:
http://webcorp.connect.org.uk/

The tool allows the user to submit a word or phrase for which
instances of usage are required. The search term is
submitted to a web search engine of the user's choice and the tool
then visits all the web sites found by the search engine,
automatically extracting concordance lines from them.  The search is
currently customisable in terms of contextual span, case sensitivity
and output format, with further options under development.

The user is not required to specify particular web sites to be
searched. Instead, the tool searches all sites on the web which are
accessible via the chosen search engine. One of the search engine
options available is Metacrawler, which itself searches other search
engines, maximising coverage and automatically removing duplicate
results.

The tool is available for trial and you are kindly requested to
provide feedback on your experience and needs, which will be taken
into account in ongoing development.

Andrew Kehoe
RDUES
andrew at rdues.liv.ac.uk