TTR variation

Tue Oct 12 20:37:27 UTC 1999

Dear Info-childers,

This is a bit long, so I'll put the question briefly and then in detail for
those interested.

In brief :

Does anyone know of any research into statistical variation in TTR between
corpora of different sizes in otherwise comparable (or identical) subjects?

In detail :

Research context:
I am putting together anglophone monolingual control groups from the CHILDES
database with which to compare my own bilingual (French and English) data.
My
own data concern 6 subjects at various ages (several ages for each subject).
I
therefore have about a dozen or so subcorpora each corresponding to a
particular bilingual subject at a particular age, for which I need the same
number of control groups. I have been trawling through the CHILDES database,
intending to find children with a similar profile, so as to isolate as far
as
possible the bilingualism of my own subjects as their distinguishing
feature.
The first selection was by age, and then I decided to narrow the focus
further
using purely linguistic measures (MLU for grammatical development and TTR
for
lexical development). The problem I am encountering is that my corpora are
not
of comparable size to the size of the corpora I have so far selected from
the
database. While difference of size is not in itself a problem as far as
comparing by MLU is concerned (there are other major problems, like various
transcribers' different definitions and intuitions of what an utterance is),
such difference in size is a well known problem with TTR: in view of the
repitition of high-frequency words, TTR mathematically decreases as the size
of the corpus grows.

Question:
I would therefore be interested to know of any research into the
mathematical
predictability of such variation, which would make it possible to
extrapolate,
for example, that a subject with a TTR of 0,3 in a corpus of 300 tokens
could
be expected to have a TTR of 0,2 in a corpus of  500 tokens. I would imagine
that such a statistical tendancy, if predictable with any accuracy,  could
be
elegantly expressed by some graph in which the horizontal axis would be TTR
(0
to 1) and the vertical axis the number of tokens (zero to ??): the curve
would
start bottom right at TTR 1 for a corpus of 1 token and rise towards top
left,
becoming increasingly steep as the limits of the subject's putative
vocabulary
are reached.

If such a tool exists, it is no doubt of wide application and I am sure
others
will have used it and can point me in the right direction.

I will post a summary of any answers.

Thanks.

Charles Watkins.