[Corpora-List] WebCorp counts

j_kurjian at hotmail.com j_kurjian at hotmail.com
Thu Apr 28 20:50:24 UTC 2005


Thanks; yes, that helps.  I now know what the upper cut off is - and that's
fine.  As I said, the limit makes things more manageable.

Regards,
Jerry

>
>Dear Jerry Kurjian
>Apologies for the difficulties you are having with WebCorp-generated
>counts, but they are only temporary, we promise. A new version of WebCorp,
>to be released soon, will incorporate our own purpose-built search engine,
>and thus be able to offer accurate frequency counts, type/token ratios,
>collocational profiles and other statistics.
>
>To explain the problem you have had:
>at the moment WebCorp takes the first 200 hits for your search term from
>your chosen search engine (Google by default) and extracts concordances
>from those pages. Unless you choose the 'one concordance line per site'
>option, there is no limit on the number of concordance lines extracted from
>each of these 200 pages.
>
>However, you will sometimes get fewer than 200 concordance lines in the
>WebCorp output for your search term. This happens if you have chosen
>additional filtering options (which will filter out some of the 200 hits
>from Google), or if certain pages are not accessible when WebCorp tries to
>access them or have changed since they were indexed by Google and no longer
>contain your search term.
>
>Statistics extracted from the Web are inherently unreliable. AltaVista no
>longer returns word counts, and the number of 'hits' returned by Google is
>the number of pages containing your search term, not the number
>of occurrences of your search term on the Web.  Problems with Google counts
>were discussed recently on this list:
>http://torvald.aksis.uib.no/corpora/2005-1/0191.html
><http://torvald.aksis.uib.no/corpora/2005-1/0191.html> .
>
>Hope this helps.
>Andrew Kehoe and Antoinette Renouf
>
>-----------------------------------------
>Research and Development Unit for English Studies
>School of English
>University of Central England, Birmingham
>http://rdues.uce.ac.uk/ <http://rdues.uce.ac.uk/>
>
>
>
>http://www.webcorp.org.uk/
>-----Original Message-----
>From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
>Behalf Of j_kurjian at hotmail.com
>Sent: 23 April 2005 17:02
>To: corpora at uib.no
>Subject: [Corpora-List] WebCorp counts
>
>Hi all,
>I have a question about the concordance counts produced by the WebCorp
>site:
>
>http://www.webcorp.org.uk/wcadvanced.html
>
>For example, if I search ''suggest you don't'' vs. ''suggest that you
>don't'' using WebCorp (via Google) I get, at the bottom of the page, a
>concordance count of 187 vs. 96 kwics respectively. However, if I search
>the same two terms, in quotes, on Google, I get 34,200 vs. 16,200 hits.
>The ratios are similar though not the same.
>
>Does anyone have insight into how WebCorp calculates/filters its
>concordances or why these two engines are so different in the number of
>hits they return?
>
>In fact, it is nice to have the more manageable number produced by
>WebCorp,
>and the external collocate counts it creates. But, for example, if I am
>interested in
>the frequency of ''I'' collocating with the two search terms based on
>WebCorp, I'd like to be clearer how those two counts are derived.
>
>Jerry
>
>_________________________________________________________________
>Express yourself instantly with MSN Messenger! Download today it's FREE!
>
>http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>
>
>
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



More information about the Corpora mailing list