Corpora: when does a subcorpus become a corpus?

Ute Römer ute.roemer at uni-koeln.de
Fri Dec 28 16:22:24 UTC 2001


Dear John,

I think the questions you asked about the representativeness of BNC
subcorpora are very important!
You asked:
> What sort of representativeness do the 4 million or so words of academic
prose have, once they have been >detached from the larger British National
Corpus? AND
> Does this transplanted body of texts become less representative once it is
withdrawn from the co-text of the >BNC and does it then become an
opportunistic corpus or a "quick and dirty" collection of texts?
My answer to the last question would be "Yes, it does" (although I wouldn't
call the subcorpus a quick and dirty collection of texts). It seems to me
that a selection of 4 million words of English academic prose is too small
to be representative of the whole range of EAP. However, the important
question in this context is "What do you want to do with the (sub)corpus?" A
4 million word (sub)corpus is probably large enough to carry out research on
frequent lexico-grammatical phenomena (e.g. if-clauses or modals) but it
might be too small for studies on less frequently used
words/lexemes/structures (and especially for non-single word items).
When they compiled the BNC, the compilers (as you already mentioned) had in
mind to create a representative collection of contemporary British English
(and as far as I can tell the BNC IS such a representative sample). What we
cannot be 100% sure about, however, is whether the compilers were also
aiming at the creation of a representative collection of EAP (and other
genres/subgenres) within their corpus. Maybe one of the BNC experts
subscribed to the list can help?

All the best from Cologne,
Ute

ute.roemer at uni-koeln.de



More information about the Corpora mailing list