Corpora: when does a subcorpus become a corpus

Fri Jan 4 12:59:34 UTC 2002

At 14:44 4.1.2002 +0300, P bI K O B_          B.B. wrote:
>I am afraid that my opinion is different.
>If I have any special corpus - Russian newspaper prose, Mexican proverbs
>or German political metaphors - then - any my results based on these
>corpora would be true for the language of Russian newspapers, Mexican
>proverbs etc ONLY.
>But - my results - any observations on any speech phenomenon based on
>general properly compiled corpus would be true for the language IN GENERAL.

Dear Vladimir, I see no difference in your and my opinions, except that I
doubted that general corpora do really exist. Here I mean by 'general
corpus' a corpus you could use for any linguistic purposes, that is a
corpus supposed to be representative of a language in general.

It is clear that the more restricted or more specialized your corpus is,
the less generalizable to language as a whole your results are, obviously,
because the (usually imaginary) total populations are different (smaller
and more clear-cut for more specialized corpora). Without doubt, not
everything that is true for a corpus of Russian newspaper prose is true for
the Russian language as a whole, but they still have something in common.
The advantage of a specialized corpus is that the (special) features you
are interested in are more evident there. However, suppose that any
"smaller population" (i.e. sample, or corpus) is a part of the "totality"
(the language) that cannot be achieved by any means. Thus, if you are going
to say something about the totality, the findings should be - more or less
easily - observable also in any part of it, in any sample (more or less
specialized corpora). So, I guess (almost) anything that is true for a
"general corpus" should be true for more specialized corpora as well, if
you consider it a feature of a language. (But again, obviously, not the
other way round.) The point of this fuzzy writing is that one can get the
picture about language only through cumulative evidence gathered from
different sources, i.e. from (more or less specialized) corpora. And still
this picture will be skewed and scrimpy, since we do not even know exactly
what we are looking for; we do not know when the picture is complete (if
ever). Well, maybe I should have started to study philosophy instead of
corpus linguistics...

sincerely,
sampo

( : ============================================= : )

Sampo Nevalainen, M.A.
Researcher
University of Joensuu
Savonlinna School of Translation Studies
P.O.Box 48
FIN-57101 Savonlinna
FINLAND

tel     +358-15-511 70      (operator)
         +358-15-511 7704
fax     +358-15-515 096
email   samponev at cc.joensuu.fi
http://www.joensuu.fi/slnkvl/