Corpora: when does a subcorpus become a corpus?

jmck at mail.estv.ipv.pt jmck at mail.estv.ipv.pt
Mon Dec 24 01:57:15 UTC 2001


>>From sunny but chilly Portugal
Colleagues
I am compiling a corpus of learner English argumentative essays and am on the lookout for  a reference or "control" corpus of native speaker writing. 
When I extract a sub-corpus from  a corpus, for example by narrowing the 100 million-word BNC down to approximately 4 million words of written academic prose using David Lee’s trusty BNC Indexer, or extracting all the social science extracts from the Brown Corpus, what exactly is the nature of such an extract or sub-corpus? Corpora carry their pedigree proudly and the BNC, Brown and other corpora spell out the sampling methods used in their compilation. Thus BNC makes a strong claim to be representative of British English in the 90s. Michael Rundell in a mid-90s symposium in Portugal suggested that we could get another good 15 years’ mileage   out of the BNC. We can see from the continuing usefulness of Brown and LOB, that perhaps he was being over-cautious.
What sort of representativeness do the 4 million or so words of academic prose have, once they have been detached from the larger British National Corpus? If the BNC is a sample of British turn-of-the-century language what is the status of the set of all texts written by and for academics on academic subjects which happen to be contained in it? Do those features which the BNC compilation team considered distinctive in the spoken and written texts or extracts they chose for inclusion in the corpus also have application when compiling a selection of written texts to represent EAP?  Does this transplanted body of texts become less representative once it is withdrawn from the co-text of the BNC and does it then become an opportunistic corpus or a “quick and dirty” collection of texts? 
Or rather, can we say that because the academic texts were chosen to represent British 90s academic writing within the broader (more heterogeneous) slice of linguistic life which is the BNC, then the more homogeneous subcorpus will be at least as representative of its register as the BNC is of the whole (British) language? I surmise that the texts chosen to represent academe in  a general corpus might not correspond with the candidate texts for inclusion in an EAP corpus, and even less so when one specific genre is being studied.
Merry Christmas to everyone and all the best for the new millennium.
John McKenny
Departamento de Gestão
Escola Superior de Tecnologia
Campus Politecnico
3500 Viseu
Portugal
jmck at dgest.estv.ipv.pt



More information about the Corpora mailing list