16.47, Review: Lang Acq/Psycholing: Malvern et al (2004)

Wed Jan 12 07:44:38 UTC 2005

LINGUIST List: Vol-16-47. Wed Jan 12 2005. ISSN: 1068 - 4875.

Subject: 16.47, Review: Lang Acq/Psycholing: Malvern et al (2004)

Moderators: Anthony Aristar, Wayne State U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>

Reviews (reviews at linguistlist.org) 
        Sheila Collberg, U of Arizona  
        Terry Langendoen, U of Arizona  

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, Wayne
State University, and donations from subscribers and publishers.

Editor for this issue: Naomi Ogasawara <naomi at linguistlist.org>
================================================================  

What follows is a review or discussion note contributed to our 
Book Discussion Forum. We expect discussions to be informal and 
interactive; and the author of the book discussed is cordially 
invited to join in. If you are interested in leading a book 
discussion, look for books announced on LINGUIST as "available 
for review." Then contact Sheila Collberg at collberg at linguistlist.org. 

===========================Directory==============================  

1)
Date: 11-Jan-2005
From: Philip McCarthy < pmccarthy at mail.psyc.memphis.edu >
Subject: Lexical Diversity and Language Development 

-------------------------Message 1 ---------------------------------- 
Date: Wed, 12 Jan 2005 02:42:51
From: Philip McCarthy < pmccarthy at mail.psyc.memphis.edu >
Subject: Lexical Diversity and Language Development 

AUTHORS: Malvern, David D.; Chipere, Ngoni; Richards, Brian J.; Durán, Pilar
TITLE: Lexical Diversity and Language Development 
SUBTITLE: Quantification and Assessment
PUBLISHER: Palgrave Macmillan 
YEAR: 2004 
Announced at http://linguistlist.org/issues/15/15-2189.html

Philip M. McCarthy, Department of English (Linguistics), and the Institute 
for Intelligent Systems (IIS), the University of Memphis.

"Lexical Diversity and Language Development: Quantification and 
Assessment" is, predominantly, a summary of David Malvern and Brian 
Richards' last seven years' work on the lexical richness measure known 
as 'D'. The measure D, it is argued, is the most reliable measure of 
lexical diversity and is particularly useful for measuring short 
transcripts such as those produced by young children. The book is of 
interest to researchers working in the areas of language acquisition, 
English as a second language (ESL), aphasiology, or any other field where 
the quantification of language deployment (lexical diversity) is a factor.

Lexical diversity, reported throughout this book (despite the title) as 
lexical richness, is one of the greatest linguistic enigmas -- if a rather 
unsung one. In brief, we have long known that people of different ages and 
abilities, and different texts for different purposes, appear to produce 
significantly different degrees of lexical diversity. No one, for 
instance, would argue that Shakespeare was less diverse in his vocabulary 
deployment than would be a typical five-year old child. And by the same 
token, we all seem to intuitively know that works by such authors as Joyce 
or Tolstoy are lexically richer than are works by, say, Hemmingway or 
Steinbeck. Despite such appearances, however, no one has yet been able to 
produce a measure that is capable of scoring such differences meaningfully 
and accurately: It is as if we were all aware of differences in 
temperature, had tacitly agreed what constituted heat, and yet had been 
unable to invent the thermometer. What Malvern et al. are offering us, 
therefore, is the best yet attempt at a lexical diversity thermometer.

Malvern et al.'s book is organized into four parts. The first, and main 
part of the book, serves to explain the concept of lexical richness, to 
outline why lexical richness is such a tricky and elusive measurement, to 
explain where and how lexical richness measures have been employed, to 
discuss the various types of lexical richness measures that have been 
proposed, to show where and why these measures fail to reliably account 
for lexical richness, and, most importantly, to introduce and discuss the 
measure known as D. Part II offers a collection of previously published 
papers that serve to support the authors' claims as to D's reliability. 
Part III offers a look at other considerations for lexical richness 
measures, and part IV is a brief overview and conclusion.

The book's review of previously proposed measures of lexical richness is 
probably the most thorough ever published. The authors begin by explaining 
the underlying problem of basic lexical richness measures, such as type-
token ratio (TTR). In brief, types are the words used in a text, whereas 
tokens are the instances of words used in a text. Thus, the sentence "the 
big dog chased the small dog" has four types and six tokens; the 
types "the" and "dog" having two tokens each. The problem, as Malvern et 
al. explain, is that as a text increases in length the likelihood of new 
types being introduced decreases. Consequently, the longer a text is, the 
lower the TTR is likely to be.

Over the years, numerous alternatives to TTR have been proposed, and 
Malvern et al. explain each with great clarity. Mathematically manipulated 
lexical richness scores such as RootTTR (G) and Corrected-TTR (C), 
logarithmic variations of lexical richness such as R and H, and frequency 
based measures such as Z and K are all explained, dissected, and 
discredited. Malvern et al. show the problems with each measure through 
theoretical and empirical approaches. The studies of Jarvis (2002) and 
Tweedie and Baayen (1998) form a good deal of the empirical testing that 
have shown problems with other lexical richness measures, and where theory 
rather than empirical evidence discredits the measures, Malvern et al. go 
to great lengths themselves to explain the problems.

Part I builds towards the most complex method of obtaining a lexical 
richness score: the "curve fitting" approach of Sichel (1986). It is 
largely on the basis of this model that Malvern et al. have composed their 
measure of D. Like Sichel's model, D operates by trying to fit empirical 
data, derived from TTR scores, to a theoretical TTR curve. D differs from 
Sichel in a number of ways: of primary importance is that D operates by 
taking hundreds of samples of data and averaging them to fit an ideal TTR 
curve. Because of the complexity of D, the freely available vocd software 
(MacWhinney, 2000) is used to make the calculation. The 18 pages dedicated 
towards D's development are highly enlightening and clearly the book's 
most important section. Despite the fact that much of what is written in 
this section has been said in previously published journal articles 
(Malvern & Richards, 1997; McKee, Malvern & Richards, 2000), the 
thoroughness and clarity in which the development of D is relayed here is 
without doubt well worth the read.

If part I is the synthesis and expansion of D's genesis (Malvern & 
Richards, 1997; McKee et al., 2000; Duran, Malvern, Richards, & Chipere, 
2004) then Part II is simply the collection and reprinting of more recent 
papers (Malvern & Richards, 2002; Richards & Malvern, 2004). The four 
chapters forming Part II provide empirical evidence supporting D and its 
operating methodology: Chapters' 4 and 5 focus on measures of D across 
different corpora, Chapter 6 offers compelling evidence on the 
inadequacies of assessment examination testing as opposed to the 
reliability of results produced by D, and Chapter 7 investigates how 
variations in lemmatizing the analysis of words can lead to markedly 
differing results. While these chapters would have been more convincing 
had there been more work from other researchers, Malvern et al.'s own 
breadth of experimentation and investigation is quite forceful. Hopefully, 
more research will soon be underway to support even further these initial 
findings.

Part III of the book compares lexical richness to other methods for 
assessing texts: type-type ratios (as opposed to type-token ratios), for 
example, are considered. Evidence compiled here suggests that 
investigations into the diversity of parts of speech are also a product of 
text length and that, once again, D may provide the best answers. In Part 
III, the authors also expand the investigation of D's reliability into 
written texts concluding that the measure effectively discriminates across 
ages and developmental levels. Part IV is a bare six-page overview and 
conclusion. The brevity is somewhat disturbing as one would imagine the 
potential for future research involving lexical richness and D would be 
vast. And it would certainly seem apparent that far more testing of D 
would be undertaken. That said, Malvern et al. do take this opportunity to 
once more drive home the importance of an accurate measure of lexical 
richness, and they once more go to great pains to show how numerous 
previous studies using flawed measures of lexical richness have lead to 
results that must now be seriously questioned (for example, see Le Normand 
& Cohen, 1999; Ouellet, Cohen, Le Normand, & Braun, 2000; and Dalaney-
Black et al. 2000). Even studies as recent as Ertmer, Strong, and 
Sadagopan (2003) use TTR of differing text lengths and quote the 
questionable "norms" of Templin (1957). Malvern et al. show their clear 
concern by writing:

These things matter. Much of the research based on flawed measures  has 
significant implications for theory, practice, and policy. It is important 
therefore that the methodological issues of measuring vocabulary richness 
are understood and that these confusions are cleared up.

The authors' conclusion also acknowledges a few of D's problems: problems 
involving topic change and rhetorical styles that confound the curve 
fitting approach of D. Such problems are not dwelt upon however, and it 
would be fair to assume that later analyses of D will be somewhat more 
critical.

The authors' claim that previous LD measures are unreliable and their 
evidence for such claims are well made. It would be hard to believe that 
following such work any previously published approach could now win favor 
as the lexical richness measure of choice. Unfortunately, whether D itself 
is truly capable of carrying the crown is also, as we shall see, less than 
assured.

As the book is essentially an advertisement for D, rather than a 
disinterested history of lexical richness, criticism and potential 
problems with D are less than boldly stated. The main problem for D lies 
in its limitations caused by the attempt to satisfy its primary aim. As 
stated above, this aim is to offer a reliable measure of lexical richness 
for short samples of transcripts. The problem for Malvern et al. is that 
while other measures of lexical richness are particularly weak at 
measuring short samples, in establishing a measure that actually does 
accomplish the task, Malvern et al. appear to have made a measure that is 
only accurate for short samples. In other words, we must ask whether the 
baby has been thrown out with the bathwater. A closer look at how D is 
calculated may show why this is so.

Malvern et al. use the vocd system to sample items from the available 
data. These samples are between 35 and 50 tokens in length. As such, the 
minimum transcript size is 50 words; however Malvern et al. claim that 
they cannot guarantee lexical richness for samples this small. Thus, the 
lower end of reliability for the measure is not made clear -- except to 
say that it must be above 50 tokens. Similarly, Malvern et al. cannot 
claim that D is reliable for longer texts. In fact, they place their upper 
limit at an unspecified "few hundred" tokens. The first question to ask, 
therefore, is, if D is reliable then where exactly is it reliable? The 
transcript borders are not that far apart (greater than 50 tokens but less 
than a few hundred), yet if the border areas are so murky then researchers 
would seriously have to wonder whether their data were of a suitable 
length for D.

The next issue is that Malvern et al. recommend using only stem forms in 
any lexical analysis so as to reduce the potential for confounding 
results. They further recommend controls for testing participants so as 
conversational topics do not diversify greatly. Perhaps most worryingly, 
however, is that they base the primary evidence of empirical testing on a 
corpus of 32 transcripts from children of just 2;8 years of age (Duran et 
al. 2004).

Such limited borders of transcript size, based on the production of such 
young children, from such a small corpus, with only stem forms recommended 
for fear of confounding D, does not yet secure faith that D is the most 
reliable (nor the most robust) of lexical richness measures.

We can look at Owen and Leonard's (2002) study for supporting concerns 
over D. In this work, it was concluded that D may not be a reliable 
measure of lexical richness. Owen and Leonard's transcripts were divided 
into sample sizes of 100, 250 and 500 tokens but when measured for lexical 
richness, differing D scores were produced. Jarvis (2002) despite knowing 
of D, chose to use an earlier D incarnation (see Malvern and Richards 
1997) and was quite critical of the theoretical unpinning of the latest 
version of D (the one used in this book). The earlier D, used by Jarvis 
(2002), was quite successful at predicting lexical richness measures; 
however, the texts used in his study all had less than 400 words, and an 
alternative measure, U, actually performed better. Silverman and Bernstein 
Ratner (2002), on the other hand, do provide support for D, and Owen and 
Leonard (2002), while finding fault with D, still mention that it is a 
promising tool. On the whole, however, while Malvern and his colleagues 
continue to turn out positive studies on D, the wider community has not 
yet reached the same level of enthusiasm.

With a relatively limited use for the measure D, it is extremely hard to 
see how the measure could become the standard for lexical richness. That 
said, whatever the weaknesses of D, it does appear to be more reliable 
than any other available measure for texts of shorter length. Researchers 
would certainly be strongly advised to, at least, include D in their 
measurements, whatever the text size. However, with data of differing text 
length, or from different sources, researchers are equally strongly 
advised to interpret results with great care. While D itself may yet have 
a number of problems to overcome, while Malvern et al. may well have been 
a shade generous in their assessment of D, and while this book appears to 
promise much discussion on lexical diversity but in the end serves more as 
a commercial for a single measure, the book itself is nonetheless clearly 
the best (and indeed the only) book on lexical diversity currently 
available. Its competitors, Yule (1944) and Herdan (1960) have long been 
out of date, and a more recent offering by Baayen (2001) neither comes 
close to the expansive history offered by Malvern et al., nor does it 
focus on diversity so much as it does distribution. The significance of 
the differences between the two approaches may best be described by 
stating that neither author sees fit to mention the others' work. In sum, 
Lexical Diversity and Language Development makes a good attempt to fill a 
gaping hole in linguistic enquiry; however, whether its proposed product 
lives up to its authors' faith will only be revealed if greater research 
in this area (and through this method) is undertaken.

REFERENCES

Baayen, R. H. (2001). Word frequency distributions. Kluwer Academic 
Publishers, Dordrecht.

Dalaney-Black, V., Covington, C., Templin, T., Kershaw, T., Nordstrom-
Klee, B., Ager, J., Clark, N., Surendon, A., Martier,S., and Sokol, R. J. 
(2000). Expressive language development of children exposed to cocaine 
prenatally: Literature review and report of a prospective cohort study. 
Journal of Communication Disorders, 33, 463-81.

Ertmer, D. J., Strong, L. M., and Sadagopan, N. (2003). Beginning to 
communicate after cochlear implantation: Oral language development in a 
young child. Journal of Speech, Language and Hearing, 46, 328-40.

Herdan, G. (1960). Type-Token mathematics: A textbook of mathematical 
linguistics. The Hague: Mouton.

Jarvis, S. (2002). Short texts, best fitting curves, and new measures of 
lexical diversity. Language Testing, 19, 1-15.

Le Normand, M. T., and Cohen, H. (1999). The delayed emergence of lexical 
morphology in preterm children: The case of verbs. Journal of 
Neurolinguistics, 12, 235-46.

MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk (3rd 
ed, Vol. 1: Transcription format and programs). Mahwah, NJ: Erlbaum.

McKee, G., Malvern, D. D., and Richards, B. J. (2000). Measuring 
vocabulary diversity using dedicated software. Literary and Linguistic 
Computing, 15, 323-38.

Malvern, D. D., and Richards, B. J. (1997). A new measure of lexical 
diversity. In A. Ryan and A. Wray (Eds), Evolving models of language: 
Papers from the Annual Meeting of the British Association of Applied 
Linguists held at the University of Wales, Swansea, September 1996 (pp. 58-
71). Clevedon: Multilingual Matters.

Malvern, D. D., and Richards, B. J. (2000). Investigating accommodation in 
language proficiency interviews using a new measure of lexical diversity. 
Language Testing, 19, 85-104.

Ouellet, C., Cohen, H., Le Normand, M. T., and Braun, C. (2000). 
Asynchronous language acquisition in developmental dysphasia. Brain and 
Cognition, 43, 352-7.

Owen, A. and Leonard, L. B. (2002). Lexical diversity in the spontaneous 
speech of children with specific language impairment: Application of VOCD. 
Journal of Speech, Language and Hearing Research, 45, 927-37.

Richards, B. J. and Malvern, D. D. (2004). Investigating the validity of a 
new measure of lexical diversity for root and inflected forms. In K. 
Trott, S. Dobbinson and P. Griffith, eds., The child language reader 
(pp.81-9). London: Routledge.

Sichel, H. S. (1986). Word frequency distributions and type-token 
characteristics. Mathematical Scientist, 11, 45-72.

Silverman, S. and Bernstein Ratner, N. (2002). Measuring lexical diversity 
in children who stutter: application of vocd. Journal of Fluency 
Disorders, 27, 289-304.

Templin, M. (1957). Certain language skills in children. Minneapolis: 
University of Minneapolis Press.

Tweedie, F. J., and Baayen, R. H. (1998). How variable may a constant be? 
Measures of lexical richness in perspective. Computers and the Humanities, 
32, 323-52.

Yule, G. U. (1944). The statistical study of literary vocabulary. 
Cambridge: Cambridge University Press. 

ABOUT THE REVIEWER

Philip McCarthy moved to the United States in 2001 having spent 11 years 
as an English teacher in England, Turkey and Japan. In 2003, he graduated 
with a Master's degree in English (Linguistics) from The University of 
Memphis, and he is currently conducting research for his Ph.D. in applied 
linguistics at the same university. Philip's primary work concerns lexical 
and textual diversity algorithms though he has also published work on 
child readers and the application of cohesion measures across genres. 
Philip is currently working as a research assistant on three grants at the 
Institute for Intelligent Systems at the FedEx Institute for Technology: 
iSTART, CohMetrix, and the iMAP project. His primary responsibilities are 
corpus analyses and programming. Philip teaches a variety of linguistics, 
ESL and composition courses. He is also working on a number of software 
projects including a phoneme acquisition application, and temporal and 
structural cohesion algorithms. When not working, Philip coaches one of 
Memphis's most successful soccer teams: Strangers FC.

-----------------------------------------------------------
LINGUIST List: Vol-16-47