TTR variation

Thu Oct 14 12:51:33 UTC 1999

Dear Watkins,
the question is dealt with in "The statistical structure of a text and its
readability", by Juhan Tuldava, publ. in ALTMANN, G & HREBICEK, L.
Quantitative text analysis. Trier: wissenschaftlicher verlag trier (series
quantitative linguistics n.52),1993.

----------
> De: Charles Watkins <charles.watkins at wanadoo.fr>
> Para: info-childes at childes.psy.cmu.edu
> Assunto: TTR variation
> Data: Terça-feira, 12 de Outubro de 1999 17:37
> 
> Dear Info-childers,
> 
> This is a bit long, so I'll put the question briefly and then in detail
for
> those interested.
> 
> In brief :
> 
> Does anyone know of any research into statistical variation in TTR
between
> corpora of different sizes in otherwise comparable (or identical)
subjects?
> 
> In detail :
> 
> Research context:
> I am putting together anglophone monolingual control groups from the
CHILDES
> database with which to compare my own bilingual (French and English)
data.
> My
> own data concern 6 subjects at various ages (several ages for each
subject).
> I
> therefore have about a dozen or so subcorpora each corresponding to a
> particular bilingual subject at a particular age, for which I need the
same
> number of control groups. I have been trawling through the CHILDES
database,
> intending to find children with a similar profile, so as to isolate as
far
> as
> possible the bilingualism of my own subjects as their distinguishing
> feature.
> The first selection was by age, and then I decided to narrow the focus
> further
> using purely linguistic measures (MLU for grammatical development and TTR
> for
> lexical development). The problem I am encountering is that my corpora
are
> not
> of comparable size to the size of the corpora I have so far selected from
> the
> database. While difference of size is not in itself a problem as far as
> comparing by MLU is concerned (there are other major problems, like
various
> transcribers' different definitions and intuitions of what an utterance
is),
> such difference in size is a well known problem with TTR: in view of the
> repitition of high-frequency words, TTR mathematically decreases as the
size
> of the corpus grows.
> 
> Question:
> I would therefore be interested to know of any research into the
> mathematical
> predictability of such variation, which would make it possible to
> extrapolate,
> for example, that a subject with a TTR of 0,3 in a corpus of 300 tokens
> could
> be expected to have a TTR of 0,2 in a corpus of  500 tokens. I would
imagine
> that such a statistical tendancy, if predictable with any accuracy, 
could
> be
> elegantly expressed by some graph in which the horizontal axis would be
TTR
> (0
> to 1) and the vertical axis the number of tokens (zero to ??): the curve
> would
> start bottom right at TTR 1 for a corpus of 1 token and rise towards top
> left,
> becoming increasingly steep as the limits of the subject's putative
> vocabulary
> are reached.
> 
> If such a tool exists, it is no doubt of wide application and I am sure
> others
> will have used it and can point me in the right direction.
> 
> I will post a summary of any answers.
> 
> Thanks.
> 
> Charles Watkins.
> 
>