FW: [Corpora-List] The genre of the Web

Mark Davies Mark_Davies at byu.edu
Sun Sep 18 20:47:29 UTC 2005

As I mentioned in my original post, we all know that there is a bit of every register on the Web -- SPOKEN (transcripts of interviews, etc), FICTION (repositories of literature), lots of NEWSPAPERS, ACADEMIC-oriented materials, etc etc.  So, no question about that of course -- the Web has a bit of everything.
The original question, though, was which genres/registers (of the BNC, for example) would have frequency data that would correspond *most closely* to reliable frequency data from the web -- i.e. for the Web *as a whole*?
In some very, very preliminary work that I've done, it appears that the frequency data from the web is *most* in line with the frequency data from either the newspaper or academic registers of the BNC, rather than spoken or fiction.  Again, not to say that there isn't a bit of everything, but it is *most similar* to the registers just mentioned.
Part of the reason that I asked the question in the first place has to do with pedagogical concerns.  Suppose that my students obtain frequency data from the web as well as frequency data from a spoken corpus.  My guess is that they will find a fair amount of frequency data (lexical, grammatical, etc) in the spoken corpus that are relatively more common than that of the Web, and vice versa.  My guess, though (based on very preliminary data) is that there would be less of a mismatch with newspaper or academic-based corpora.
>From what I've gathered taking to others over the past year, the issue of what register(s) make up the Web is an ongoing and important question for some researchers.  I'd be interested in hearing from those people.
Mark Davies
Mark Davies
Assoc. Prof., Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **


From: owner-corpora at lists.uib.no on behalf of John F. Sowa
Sent: Sun 9/18/2005 12:28 PM
To: Mark P. Line
Cc: corpora at uib.no
Subject: Re: [Corpora-List] The genre of the Web

I agree with Mark Lane on that point:

 > I don't think of the Web as a genre at all.

On the other hand, it's not clear that the web
is a medium.

 > It's a very flexible medium, in fact, because
 > it seems to carry all genres effectively.

In that regard, it's more like a very dynamic
library.  But it is also as interactive as
telephones or video games (which it carries
as well).

And I certainly don't agree with Mark Davies on
that point:

 > most would probably agree that the web is more

That's probably what most people on Corpora list
would say.  But the people who make the most money
from the web are the gambling casinos and the
porno peddlers.

John Sowa

More information about the Corpora mailing list