Corpora: What is a corpus

Lou Burnard lou.burnard at computing-services.oxford.ac.uk
Fri Jan 28 10:38:27 UTC 2000


This turns out to be quite an interesting discussion, since it really
hinges on what a "proverb" is. If Francois had said (say) a corpus of
sermons, or a corpus of advertisements, or a corpus of texts composed
by 18th century french expatriate seamen with wooden legs, I don't
think Oliver would have turned a hair (well, maybe in the last
example) because all of those things are definable as types of text or
artefact or entity or whatever. But proverbs don't seem to fit in with
that list of things somehow: where would you look for proverbs?  they
don't typically appear in isolation -- you don't go to the book shop
and say "What proverbs have been published lately?" -- the newspapers
don't have lists of today's hot proverbs -- no-one ever says "I think
I'll create a proverb today" -- all of which makes me think that a
proverb is not a text, but a judgment about a bit of a text. A
collection of things-judged-proverbial is an interesting text,
certainly, but it doesn't seem to be a corpus as we currently think of
them.

So while I agree with Lucian (and everyone else) that it's the act of
filtering which defines a corpus, I feel the need to define the nature
of the holes in the filter a bit more precisely. In other words, I
think we need a definition for the *components* of a corpus, which
would accept (say) a classified advert or a conversation with a travel
agent but reject a metaphor or a proverb or even (here I feel the
ground a bit shaky) a sentence containing a past tense verb.

Lou

 ----------------------------------------------------------------
 Lou Burnard                           http://users.ox.ac.uk/~lou
 ----------------------------------------------------------------



More information about the Corpora mailing list