Corpora: What is a corpus
Mills, Carl (MILLSCR)
MILLSCR at UCMAIL.UC.EDU
Fri Jan 28 15:26:30 UTC 2000
While I agree with Susan that
>One of the real joys of working with corpora is the excitement of finding
>something you weren't looking for. The more the input to the corpus is
>filtered by the preconceptions of the researchers, the less likelihood that
>these unexpected insights will arise
I have some difficulty with
>A corpus is a collection of texts, not a list of phrases, verb forms, or
other >fragments.
Similarly, Oliver's point
>The main point I wanted to make was that I understand a
>corpus to be a lump of real language, not extracts of the same. So
you
>could have a corpus of almost anything that is a text type or
genre,
>but it wouldn't be a corpus any more once you meddle with it, by eg
> extracting all proverbs, noun phrases or whatnot.
[snip]
>By what I rather unprecisely called `filtering' I meant this
extraction
>of elements from a corpus, not the creation of a corpus from the
>infinite amount of language data by selecting a sample of it.
raises some interesting, related issues.
As a statistical linguist, a sociolinguist, I would argue that there may be
reasons--due to limits of technology, researcher time, or whatever--when one
might want to create a corpus (Or would it be a quasi-corpus? Or a
corpoid?) by properly designed random sampling from a text or body of
texts. The result would be a collection of "fragments," in Susan's terms,
"extracts" from Oliver's "lump of real language." However, I would argue
that such procedures are not only methodologically justifiable, but that the
results would constitute a corpus. I think such a collection of samples
could be called a corpus because the sampling would not be defined in
biased, a priori units like "sentence," "proverb," or whatever.
Oliver's remarks would seem to allow such corpora. I am not sure Susan's
would.
Carl
More information about the Corpora
mailing list