Corpora: when does a subcorpus become a corpus

Sampo Nevalainen samponev at cc.joensuu.fi
Thu Jan 3 09:36:04 UTC 2002


Here is a short citation from Jennifer Pearson's "Terms in Context"
(Amsterdam 1998), p. 45:

--
Sinclair, who states that corpora can be divided into subcorpora, and that
corpora and subcorpora can be divided into components, defines a subcorpus
as having "all the properties of a corpus but happens to be part of a
larger corpus" (1994a:4). Thus, a subcorpus must have all the properties of
a larger corpus. We understand this to mean that it is representative of
the larger corpus. A component, on the other hand, according to Sinclair,
illustrates a particular type of language and is selected "according to a
set of linguistic criteria that serve to characterize its linguistic
homogeneity" (Sinclair 1994a:4). It differs from a subcorpus in that it is
not intended to be representative of the corpus from which it is drawn and
is therefore not necessarily an adequate sample of a language.
--

I did not go back to Sinclair ("Corpus Typology: A Framework for
Classification", EAGLES 1994), but according to Pearson, "a subcorpus must
have all the properties of a larger corpus", thus being representative of
the larger corpus. Another question is how this can be achieved, although,
it is, obviously, safer to state that a subcorpus is representative of the
larger corpus, than argue that the larger corpus (and, consequently, the
subcorpus) is representative of a language (or genre etc.). Anyways, using
the terms defined above (without intention to agree fully with Pearson),
the set of EAP texts detached from the BNC would probably be called a
"component" rather than a "subcorpus". Personally I would like to call a
"subcorpus" ANY corpus detached from another corpus - despite its content
or composition. Whatever a set of texts is called, the question of
representativeness remains. Here I agree with Ute Roemer, who wrote: "The
important question in this context is 'What do you want to do with the
(sub)corpus?'"

sincerely,
Sampo

Ps. Please regard this as a note from a person who tends to consider the
notion of "representative of a language" as an oxymoron, a "mission
impossible".



( : ============================================= : )

Sampo Nevalainen, M.A.
Researcher
University of Joensuu
Savonlinna School of Translation Studies
P.O.Box 48
FIN-57101 Savonlinna
FINLAND

tel     +358-15-511 70      (operator)
         +358-15-511 7704
fax     +358-15-515 096
email   samponev at cc.joensuu.fi
http://www.joensuu.fi/slnkvl/



More information about the Corpora mailing list