Corpora

Thu Dec 16 18:19:23 UTC 1999

Evidently Joan wanted to make the valid point that Stanford is not just
MIT West, but then it degenerated into "my department has more corpus
linguists than your department" (unless you're Penn) (and I'm having a
little trouble remembering who they are).

More to the point might be a discussion of what corpus research involves,
what kinds of corpora there are, how they can best be exploited, etc.
Brian and others are providing an important service in this regard with
the Talkbank project.

As I see it, we're talking about an alternative to the popular kinds of
data-gathering that have involved inventing sentences in isolation and
asking people whether they "get" them, or measuring the reaction times of
college sophomores who see them written on computer screens.  The
alternative is to examine how people actually use language, a process that
necessarily involves confronting more than single sentences.  In that
sense it's part of what has come to be called the study of "discourse",
which of course can be conducted in many different ways.  It's worth
noting that some people have been examining such data for a long time:
one thinks, for example, of many language acquisition studies, of the
analysis of conversation (from various points of view), and of the
recording and analysis of "texts" collected by those who have been
studying lesser known languages.  This last kind of corpus work has been
going on for well over a hundred years.

It's worth noting, too, that this distinction between constructed language
and what I like to call "real" language ("natural language" has been
coopted with a different meaning) is orthogonal to the
formalist-functionalist dichotomy, at least in the sense that while many
functionalists do work with corpora, many do not.

It might be worth discussing the problems that arise from the supposedly
accidental nature of corpora, and the lack of the control and
replicability that are so dear to the hearts of psychologists.  One might
actually find some significance, for example, in the fact that people
rarely use a construction one might think easy to invent.  And of course
the problem tends to diminish with very large corpora.

But very large corpora may introduce a problem of their own.  Some of you
may remember Zellig Harris's book Methods in Structural Linguistics, where
he suggested we could get around the vexing problem of meaning by
examining the distribution of linguistic forms in very large corpora.
Machines to do that weren't available at the time (1950), but now they
are, and it looks to me as if some people are doing what Harris had in
mind, though so far as I know they don't refer to him.  It makes me
uncomfortable because I think it's more rewarding in the long run to
confront semantics head-on, not trying to avoid it with big corpora and
machines.

Just one last reservation.  Corpora make it easy to count things and come
up with interesting findings regarding the frequency of this or that.  But
knowing exactly what you're counting may not be such a simple matter, and
it's easy to come up with "operational definitions" that turn out in the
end to be spurious.  What I'm trying to say is that there's much of
importance to learn from examining real language, but it shouldn't seduce
us into thinking we can just crank out analyses mechanically.
Understanding the nature of language is always going to require the
intervention of perceptive human minds.

Wally Chafe