Corpora: when does a subcorpus become a corpus

P. Kaszubski przemka at amu.edu.pl
Sat Dec 29 00:00:56 UTC 2001


Hello,

The matters of representativeness are vital for corpus research, as
we all know only too well. I have been doing quite a bit of "corpora
trimming" for my own comparative purposes when trying to obtain
the most representative results for my phraseological studies with
learner corpora. Much depends of course on how clearcut are the
boundaries of the genre under analysis. My own observation has
been that it is sometimes better to ease off the chase after
"comparability" between corpora, because chances are that we will
never be exactly satisfied with the degree of match, and
consequently with the statistics derived. It makes better sense to
me to try to obtain a few corpora (or sub-corpora) claiming to
represent the same or similar enough genre and perform multiple
rather than bilateral statistical analysis.

Having said this, I do of course declare myself an advocate of
careful, principled and well documented corpus compilation in the
first place. The more we know about the EAP part of the BNC the
better we can design our tests. BTW, is the notion of SUBCORPUS
discussed theoretically and /or defined anywhere - does it
necessarily carry with it the same value of representativeness (as
defined  e.g. in Sinclair's 3C book, or the BNC handbook - in
contrast to opportunistic text collections)? How organised need we
be when extracting texts from a corpus to be able to call the result a
"subcorpus" - it just struck me as an interesting question.

Regards,

P. Kaszubski




=======================================
Dr Przemyslaw Kaszubski
t: +48 61 8293515
e: przemka at amu.edu.pl
w: http://elex.amu.edu.pl/ifa/staff/kaszubski.html

MY (ENGLISH) LEARNER CORPORA PAGE:
http://main.amu.edu.pl/~przemka

School of English
Adam Mickiewicz University
Al. Niepodleglosci 4
61-874 Poznan
t: +48 61 8293506
f: +48 61 8523103
w: http://elex.amu.edu.pl/ifa
=======================================



More information about the Corpora mailing list