[Corpora-List] A new, faster WebCorp

Andrew Kehoe Andrew.Kehoe at bcu.ac.uk
Fri Oct 31 15:27:52 UTC 2008


Dear Colleagues

We are pleased to announce the release of a faster and more reliable version of the WebCorp web concordancer at http://www.webcorp.org.uk.  As before, the advanced search options can be found at http://www.webcorp.org.uk/wcadvanced.html and there is a user guide at http://www.webcorp.org.uk/guide.

The release of this new version marks the 10th anniversary of WebCorp.  The prototype version was launched in 1998 to test the hypothesis that the web could be used as a source of linguistic data and to provide a simple mechanism for doing so. As the prototype grew in complexity and popularity, searches could be slow and unreliable at times.  The new, faster version also contains a number of other enhancements, which are listed at http://www.webcorp.org.uk/guide/changes.htm.

This 'live' version of WebCorp runs 'on top of' commercial search engines, extracting concordances from the web in real time.  In addition, we have for the past 2 years been working on the WebCorp Linguist's Search Engine, our own large-scale search engine. WebCorpLSE is crawling and processing the web to build a 10 billion word (or 7 terabyte) text corpus, including a multi-terabyte 'mini-web', designed to act as a microcosm of the web itself. In addition, WebCorpLSE includes a newspaper sub-corpus, and we have worked with colleagues to build collections to assist in their research, including sub-corpora of blogs, science fiction, Charles Dickens, Thomas Carlyle, James Joyce, and Restoration Drama. One particular success has been the annotated Anglo-Norman Correspondence Corpus, built in collaboration with Dr Richard Ingham.

The mini-web and all sub-corpora are searchable via linguistically-tailored front-ends, which allow kinds of search precluded by the original WebCorp system (full pattern matching and wildcard search, grammatical search, statistical collocation and other analyses, etc). See http://www.webcorp.org.uk/webcorp_linguistic_search_engine.html for details.  

Best wishes

Andrew Kehoe
Research and Development Unit for English Studies
School of English
Birmingham City University
http://rdues.bcu.ac.uk/
 
http://www.webcorp.org.uk/



Birmingham City University is the new name unveiled for the former University of Central England in Birmingham
For more information about the name change go to http://www.bcu.ac.uk/namechange/official_announcement.html


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list