<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META NAME="Generator" CONTENT="MS Exchange Server version 6.0.6556.0">
<TITLE>RE: [Corpora-List] WebCorp counts</TITLE>
</HEAD>
<BODY dir=ltr>
<DIV dir=ltr>Dear Jerry Kurjian<BR>Apologies for the difficulties you are having
with WebCorp-generated counts, but they are only temporary, we promise. A new
version of WebCorp, to be released soon, will incorporate our own
purpose-built search engine, and thus be able to offer accurate
frequency counts, type/token ratios, collocational profiles and other
statistics.</DIV>
<DIV><FONT size=2><FONT size=3></FONT></FONT> </DIV>
<DIV><FONT size=2><FONT size=3>To explain the problem you have had:
</FONT></FONT></DIV>
<DIV><FONT size=2><FONT size=3>at the moment WebCorp takes the first 200 hits
for your search term from your chosen search engine (Google by default) and
extracts concordances from those pages. Unless you choose the 'one
concordance line per site' option, there is no limit on the number of
concordance lines extracted from each of these 200 pages.</FONT></FONT></DIV>
<DIV><FONT size=2><FONT size=3><BR>However, you will sometimes get fewer than
200 concordance lines in the WebCorp output for your search term. This happens
if you have chosen additional filtering options (which will filter out some of
the 200 hits<BR>from Google), or if certain pages are not accessible when
WebCorp tries to access them or have changed since they were indexed by Google
and no longer contain your search term.</FONT></FONT></DIV><FONT size=2><FONT
size=3>
<DIV><BR></FONT></FONT><FONT size=2><FONT size=3>Statistics extracted from the
Web are inherently unreliable. AltaVista no longer returns word counts, and the
number of 'hits' returned by Google is the number of pages containing your
search term, not the number<BR>of occurrences of your search term on the
Web. Problems with Google counts were discussed recently on this list:
</FONT><A href="http://torvald.aksis.uib.no/corpora/2005-1/0191.html"><FONT
size=3>http://torvald.aksis.uib.no/corpora/2005-1/0191.html</FONT></A><FONT
size=3>.</FONT><BR><FONT size=3></FONT></FONT></DIV>
<DIV><FONT size=2><FONT size=3>Hope this helps.</FONT></FONT></DIV>
<DIV>Andrew Kehoe and <FONT size=2><FONT size=3></FONT></FONT>Antoinette Renouf
</DIV>
<DIV> </DIV>
<DIV><FONT
size=2><FONT>-----------------------------------------<BR></FONT><FONT
size=3>Research and Development Unit for English Studies<BR>School of
English<BR>University of Central England, Birmingham<BR></FONT><A
href="http://rdues.uce.ac.uk/"><FONT
size=3>http://rdues.uce.ac.uk/</FONT></A></FONT></DIV>
<P><FONT size=2><BR><BR><A
href="http://www.webcorp.org.uk/">http://www.webcorp.org.uk/</A><BR>-----Original
Message-----<BR>From: owner-corpora@lists.uib.no [<A
href="mailto:owner-corpora@lists.uib.no">mailto:owner-corpora@lists.uib.no</A>]
On<BR>Behalf Of j_kurjian@hotmail.com<BR>Sent: 23 April 2005 17:02<BR>To:
corpora@uib.no<BR>Subject: [Corpora-List] WebCorp counts<BR><BR>Hi all,<BR>I
have a question about the concordance counts produced by the
WebCorp<BR>site:<BR><BR><A
href="http://www.webcorp.org.uk/wcadvanced.html">http://www.webcorp.org.uk/wcadvanced.html</A><BR><BR>For
example, if I search ''suggest you don't'' vs. ''suggest that you<BR>don't''
using WebCorp (via Google) I get, at the bottom of the page, a<BR>concordance
count of 187 vs. 96 kwics respectively. However, if I search<BR>the same two
terms, in quotes, on Google, I get 34,200 vs. 16,200 hits.<BR>The ratios are
similar though not the same.<BR><BR>Does anyone have insight into how WebCorp
calculates/filters its<BR>concordances or why these two engines are so different
in the number of<BR>hits they return?<BR><BR>In fact, it is nice to have the
more manageable number produced by<BR>WebCorp,<BR>and the external collocate
counts it creates. But, for example, if I am<BR>interested in<BR>the frequency
of ''I'' collocating with the two search terms based on<BR>WebCorp, I'd like to
be clearer how those two counts are
derived.<BR><BR>Jerry<BR><BR>_________________________________________________________________<BR>Express
yourself instantly with MSN Messenger! Download today it's FREE!<BR><BR><A
href="http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/">http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/</A><BR><BR><BR><BR></P></FONT>
</BODY>
</HTML>