TTR variation

Thu Oct 14 08:22:37 UTC 1999

Dear Charles,

My colleague David Malvern and I have spent a lot of time on this problem.
We have developed a solution which mathematically models the relationship
between TTR and token size, and uses software to randomly sample tokens
from CHAT transcripts and give average TTRs for increasing token sizes. A
curve-fitting procedure adjusts one parameter (D)  to provide the best fit
between actual and theoretical curves. D is used as the measure of lexical
diversity, and validation work on this measure has been encouraging.

The software exists in a UNIX version. I recently sent a copy to Brian
MacWhinney in the hope that it can be made more widely available via
CHILDES.

Brian

==============================================
Dr. Brian J. Richards
School of Education
University of Reading
Bulmershe Court
Earley
Reading RG6 1HY
UK                   Tel 0118 9875123 (x 4814)
==============================================
On Tue, 12 Oct 1999, Charles Watkins wrote:

> Dear Info-childers,
>
> This is a bit long, so I'll put the question briefly and then in detail for
> those interested.
>
> In brief :
>
> Does anyone know of any research into statistical variation in TTR between
> corpora of different sizes in otherwise comparable (or identical) subjects?
>
> In detail :
>
> Research context:
> I am putting together anglophone monolingual control groups from the CHILDES
> database with which to compare my own bilingual (French and English) data.
> My
> own data concern 6 subjects at various ages (several ages for each subject).
> I
> therefore have about a dozen or so subcorpora each corresponding to a
> particular bilingual subject at a particular age, for which I need the same
> number of control groups. I have been trawling through the CHILDES database,
> intending to find children with a similar profile, so as to isolate as far
> as
> possible the bilingualism of my own subjects as their distinguishing
> feature.
> The first selection was by age, and then I decided to narrow the focus
> further
> using purely linguistic measures (MLU for grammatical development and TTR
> for
> lexical development). The problem I am encountering is that my corpora are
> not
> of comparable size to the size of the corpora I have so far selected from
> the
> database. While difference of size is not in itself a problem as far as
> comparing by MLU is concerned (there are other major problems, like various
> transcribers' different definitions and intuitions of what an utterance is),
> such difference in size is a well known problem with TTR: in view of the
> repitition of high-frequency words, TTR mathematically decreases as the size
> of the corpus grows.
>
> Question:
> I would therefore be interested to know of any research into the
> mathematical
> predictability of such variation, which would make it possible to
> extrapolate,
> for example, that a subject with a TTR of 0,3 in a corpus of 300 tokens
> could
> be expected to have a TTR of 0,2 in a corpus of  500 tokens. I would imagine
> that such a statistical tendancy, if predictable with any accuracy,  could
> be
> elegantly expressed by some graph in which the horizontal axis would be TTR
> (0
> to 1) and the vertical axis the number of tokens (zero to ??): the curve
> would
> start bottom right at TTR 1 for a corpus of 1 token and rise towards top
> left,
> becoming increasingly steep as the limits of the subject's putative
> vocabulary
> are reached.
>
> If such a tool exists, it is no doubt of wide application and I am sure
> others
> will have used it and can point me in the right direction.
>
> I will post a summary of any answers.
>
> Thanks.
>
> Charles Watkins.
>
>
>