Corpora: minimum size of corpu?

michael klotz mklotz at phil.uni-erlangen.de
Thu Feb 10 08:47:36 UTC 2000


Hi Elaine,

it seems to me that there is a crucial difference between studying a
"dead language" like Latin or, I suppose, Biblical Aramaic and a "living
language" like say modern English. The problem with living languages is
of course that any corpus will be tiny compared to the overall
linguistic output of the speakers of the language (for example in a
single year). If we want to use corpus evidence to say something about
the language as a whole, we are crucially concerned with the question of
how confident we can be that our corpus data actually mirror the facts
of language. This is a question for inferential statistics and the size
of our sample (i.e. corpus) plays an important role in this. (Another
important question would be how we proceeded in the sampling to achieve
representativity in terms of random sampling, stratified sampling etc.
Cf. the work done by Clear and Biber on this question)
With dead languages there are two possible approaches: in one approach
we would consider whatever evidence we have for the language as a sample
of the way the language was spoken at the time. Of course, again it
would be a tiny sample of the overall linguistic output of the speakers
at the time and the problems from above would be relevant.
However, in another sense whatever sources are left of a dead language
can be operationally considered to BE the language, since nobody will
ever produce new output in that language; i.e. there is a finite body of
parole. In this case your sample (i.e. corpus) would be identical to the
population it stands for (i.e. the "whole" language as we see it today),
and we would not be concerned with inferential statistics, but simply
summative statistics. It that case the size of the corpus would be of no
concern, I think.
Which of the two approaches you take really depends on your research
question. If you want to say something about Biblical Aramaic as found
in the extant sources, the second approach seems appropriate. If you
want to compare Biblical Aramaic to its modern descendants to say
something about how the language has changed, the first approach seems
more appropriate.

yours
Michael

--
Dr. Michael Klotz
Institut f. Anglistik und Amerikanistik
Universität Erlangen-Nürnberg
Bismarckstraße 1
91054 Erlangen
Tel.: 9131-8522938
email: mklotz at phil.uni-erlangen.de



More information about the Corpora mailing list