Corpora: What is a corpus

Fri Jan 28 15:26:30 UTC 2000

While I agree with Susan that

>One of the real joys of working with corpora is the excitement of finding
>something you weren't looking for.  The more the input to the corpus is
>filtered by the preconceptions of the researchers, the less likelihood that
>these unexpected insights will arise

I have some difficulty with

>A corpus is a collection of texts, not a list of phrases, verb forms, or
other >fragments.

Similarly, Oliver's point

	>The main point I wanted to make was that I understand a
	>corpus to be a lump of real language, not extracts of the same.  So
you
	>could have a corpus of almost anything that is a text type or
genre,
	>but it wouldn't be a corpus any more once you meddle with it, by eg
> extracting all proverbs, noun phrases or whatnot.
[snip]
	>By what I rather unprecisely called `filtering' I meant this
extraction
	>of elements from a corpus, not the creation of a corpus from the
	>infinite amount of language data by selecting a sample of it.

raises some interesting, related issues.

As a statistical linguist, a sociolinguist, I would argue that there may be
reasons--due to limits of technology, researcher time, or whatever--when one
might want to create a corpus (Or would it be a quasi-corpus?  Or a
corpoid?)  by properly designed random sampling from a text or body of
texts.  The result would be a collection of "fragments," in Susan's terms,
"extracts" from Oliver's "lump of real language."  However, I would argue
that such procedures are not only methodologically justifiable, but that the
results would constitute a corpus.  I think such a collection of samples
could be called a corpus because the sampling would not be defined in
biased, a priori units like "sentence," "proverb," or whatever.

Oliver's remarks would seem to allow such corpora.  I am not sure Susan's
would.

Carl