[Corpora-List] WebCorp counts
Antoinette Renouf
Antoinette.Renouf at uce.ac.uk
Wed Apr 27 11:21:38 UTC 2005
Dear Jerry Kurjian
Apologies for the difficulties you are having with WebCorp-generated counts, but they are only temporary, we promise. A new version of WebCorp, to be released soon, will incorporate our own purpose-built search engine, and thus be able to offer accurate frequency counts, type/token ratios, collocational profiles and other statistics.
To explain the problem you have had:
at the moment WebCorp takes the first 200 hits for your search term from your chosen search engine (Google by default) and extracts concordances from those pages. Unless you choose the 'one concordance line per site' option, there is no limit on the number of concordance lines extracted from each of these 200 pages.
However, you will sometimes get fewer than 200 concordance lines in the WebCorp output for your search term. This happens if you have chosen additional filtering options (which will filter out some of the 200 hits
from Google), or if certain pages are not accessible when WebCorp tries to access them or have changed since they were indexed by Google and no longer contain your search term.
Statistics extracted from the Web are inherently unreliable. AltaVista no longer returns word counts, and the number of 'hits' returned by Google is the number of pages containing your search term, not the number
of occurrences of your search term on the Web. Problems with Google counts were discussed recently on this list: http://torvald.aksis.uib.no/corpora/2005-1/0191.html <http://torvald.aksis.uib.no/corpora/2005-1/0191.html> .
Hope this helps.
Andrew Kehoe and Antoinette Renouf
-----------------------------------------
Research and Development Unit for English Studies
School of English
University of Central England, Birmingham
http://rdues.uce.ac.uk/ <http://rdues.uce.ac.uk/>
http://www.webcorp.org.uk/
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of j_kurjian at hotmail.com
Sent: 23 April 2005 17:02
To: corpora at uib.no
Subject: [Corpora-List] WebCorp counts
Hi all,
I have a question about the concordance counts produced by the WebCorp
site:
http://www.webcorp.org.uk/wcadvanced.html
For example, if I search ''suggest you don't'' vs. ''suggest that you
don't'' using WebCorp (via Google) I get, at the bottom of the page, a
concordance count of 187 vs. 96 kwics respectively. However, if I search
the same two terms, in quotes, on Google, I get 34,200 vs. 16,200 hits.
The ratios are similar though not the same.
Does anyone have insight into how WebCorp calculates/filters its
concordances or why these two engines are so different in the number of
hits they return?
In fact, it is nice to have the more manageable number produced by
WebCorp,
and the external collocate counts it creates. But, for example, if I am
interested in
the frequency of ''I'' collocating with the two search terms based on
WebCorp, I'd like to be clearer how those two counts are derived.
Jerry
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050427/0105aaa7/attachment.htm>
More information about the Corpora
mailing list