Corpora: What is a corpus

Lucian Galescu galescu at cs.rochester.edu
Fri Jan 28 00:48:41 UTC 2000


It strikes me as ironic that corpus linguists would want to prescribe
the usage of the word "corpus". Using Oliver's terminology, I would say
that all corpora are `filtered'. choosing 13th century texts, or
Shakespeare's plays, or conversations with a travel agent, or the Bible,
etc, etc., all are ways of filtering the abstract body of language
around us for a specific purpose, since they all involve a criterion of
what is in and what is out of the corpus.
 So, if Francois' purpose is to study proverbs, he could just as well do
it using a corpus-based methodology (i'm not saying anything about
whether that is appropriate or not -- it all depends on what his actual
goals are). And if someone else wants to study the intra-sentential
behavior of past tense verbs, they might just as well collect a corpus
of past tense sentences. Btw, recently i have also heard of corpora of
images, which goes even farther away from the original "collection of
texts" definition brought up by Paul Hays.

I would agree with Oliver when he says:

   My understanding of `corpus' is that it is some more or less
   homogeneous collection of utterances, but not `filtered'

if "homogenous" meant that there is a criterion of selecting what is in
and what is out; and (in order not to make the above 'definition'
contradictory) "not `filtered'" meant that no further restriction should
be imposed on the data, beyond the mentioned selection criterion (as
Paul mentioned, this is sometimes hard to achieve).

Have a beautiful day!
_
 -Lucian Galescu



More information about the Corpora mailing list