Atkinson on phoneme inventories in Science

Bill Croft wcroft at UNM.EDU
Thu Apr 21 16:41:42 UTC 2011


I will not discuss the content of Matthew's response here, as I have 
discussed it with him privately. I agree with his main point, that 
Atkinson should control for large-scale areal influence. I also agree 
with Ian that once one opens the door to one geographical correlation 
hypothesis, one needs to consider others as well.

However, I strongly disagree with the implications of Matthew's 
polemical remarks:

"But the big problem with the Atkinson paper and others like it is 
precisely that nonlinguists who are experts on statistics do not 
understand the peculiar nature of crosslinguistic data...Linguists 
should be very wary of seeking the advice of nonlinguists regarding 
statistics."

I don't know if this is what Matthew intended, but this sounds very 
much as if (cross)linguistic data is somehow immune to the laws of 
statistics, and hence linguists need not concern themselves with 
developments in statistics (especially since such developments are 
the work of nonlinguists).

Linguistic data is not "peculiar". Linguistic data, like other data 
from human behavior and other complex systems, is the product of 
stochastic processes influenced by a variety of factors that causally 
interact. Our task is to identify the relevant factors and determine 
their influence, if any. That can be done by a range of statistical 
methods and models, which for example can deal with large-scale areal 
influence in the phoneme inventory data if we think it should.

Of course, identifying the relevant factors depends on the causal 
models that we propose to account for the behavior. Atkinson has a 
causal model, which leads him to bring in the factors that he does 
(and he controls for quite a number of plausible confounding factors, 
though not area, if you read his supplementary materials). The 
problem is, we linguists do not believe in the causal model, so we 
don't think distance from Africa, or even population size, should be 
the only additional factors considered in the statistical analysis. 
But linguists don't all agree on causal models of language behavior 
either. (Note that Dunn et al. are a team of linguists as well as 
nonlinguists.) And sometimes we have to look outside the box and 
consider other possibilities, as Hay and Bauer (2007) did - even if 
they turn out to be artifacts.

I think that linguists should learn more about statistics. Many of 
the posts about Atkinson's paper at the Language Log, the NY Times 
article, and on Funknet do not recognize some basic statistical 
principles. Even the detailed and carefully reasoned posts would 
benefit from more detailed knowledge of statistics, I believe. I say 
that for myself as well, of course. For instance, I have been told 
(via a psycholinguist) that the puzzle I discussed, the possibility 
of different correlations in a sample and in partitions of the 
sample, has a name in statistics, Simpson's Paradox. I checked the 
indexes of the two statistics textbooks I have by linguists (Woods et 
al. 1986 and Baayen 2008), and my wife's university statistics 
textbook (Hays 1988); none of them listed Simpson's Paradox. I'm 
afraid that for me or any linguist to learn more about statistics 
means reading books written by nonlinguists, taking courses from 
nonlinguists, and/or consulting with nonlinguists.

The response by linguists to Dunn et al. and Atkinson has been 
uniformly negative. Many have also been arrogant, condescending and 
dismissive. The attitude appears to be that any work on language by 
nonlinguists, especially that using fancy statistics, is completely 
wrong. That is why I have felt obliged to defend those aspects of 
both papers that I think are positive, and to question some of the 
criticisms. This doesn't mean that I endorse their results: I don't, 
in the case of Dunn et al., and I am uncertain about Atkinson. But I 
think that the problems with Dunn et al. and with Atkinson are quite 
different - linguistically and statistically - and that it is worth 
linguists recognizing and understanding these differences.

Bill

Baayen, R. Harald. 2008. Analyzing linguistic data: a practical 
introduction to statistics using R. Cambridge: Cambridge University 
Press.

Hays, William L. 1988. Statistics (4th ed.). New York: Holt, Rinehart 
and Winston.

Woods, Anthony, Paul Fletcher & Arthur Hughes. 1986. Statistics in 
language studies. Cambridge: Cambridge University Press.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20110421/1cdfa30d/attachment.htm>


More information about the Lingtyp mailing list