Corpora: when does a subcorpus become a corpus

Robert Luk (COMP staff) csrluk at comp.polyu.edu.hk
Sat Jan 5 02:41:44 UTC 2002


This is an interesting discussion about 'representativeness'
of corpus and subcorpus. I'll add my 2 cents here. Surely,
statisticans have been concerned about getting representative
samples for some time and mechanisms available, though not
perfect, to address the above issue. The one that I can
think of is sequential (and stratified?) sampling.

Suppose we have infinite resources! And suppose we have
a (random or otherwise) sequence of subcorpora s1, s2, ...
sn and their associated distribution that we observe for
any specific purpose d1, d2, ..., dn. The distribution could
be words, the number of different meanings of a word, etc.
Then, we do a sequential sampling as follows:

Let the merged distribution Di be defined recusrively
as follows:

D1 := d1
Di := Di-1 + di

where + is merging two distributions. The sequential
sampling could stop if

Chi-Square of Di and Di-1 is not statistically significantly
different at X%.

There is a possibility that the sequential sampling
could never stop.

Obviously, more sophistcated techniques could be
applied and more complicated modeling may be needed
(e.g. taking time into account of the sampling as
language changes may take place).

Best,

Robert Luk



More information about the Corpora mailing list