The Internet Corpora

Emily Bender bender at csli.stanford.edu
Tue Apr 3 23:55:39 UTC 2001


There are tools available for using the web as a corpus.
Here is one:  http://www.webcorp.org.uk/

It is said to be particularly useful for looking for examples
of low frequency items.

I imagine that using the web as a corpus would be about the
same as using any other collection of texts as a corpus, in
that one has to be sure to keep in mind the source of the
data and various possible "impurities", such as inclusion of
diverse language varieties (this could be good or bad), degree
of editing (again good or bad, considering both typos and
lack of spontaneousness), and lack of representativeness (what
genres are really included on the web, and in what proportions?).

Emily

James A. Crippen wrote
>
> Could the World Wide Web and all the text in various languages available
> on it be considered legitimate forms of corpora?  Granted there are a
> large number of spelling mistakes (try searching for 'trasnlation' or
> 'lingiust'), and a large number of obvious grammatical errors (search for
> 'if he love me' for example), but the extant text available online
> certainly exceeds the size of any corpus for a major language by orders of
> magnitude.
>
> In the future, will we start seeing explicit WWW search results in
> papers?  This could easily become a major point of argument...
>
> 'james
>
> --
> James A. Crippen <james at unlambda.com> ,-./-.  Anchorage, Alaska,
> Lambda Unlimited: Recursion 'R' Us   |  |/  | USA, 61.2069 N, 149.766 W,
> Y = \f.(\x.f(xx)) (\x.f(xx))         |  |\  | Earth, Sol System,
> Y(F) = F(Y(F))                        \_,-_/  Milky Way.
>
>



More information about the HPSG-L mailing list