Atkinson on phoneme inventories in Science

Thu Apr 21 16:53:38 UTC 2011

Here is the wiki on Simpson's paradox: http://en.wikipedia.org/wiki/Simpson%27s_paradox

Some of the best new work in linguistics is being done by non-linguists. The work of the folks at TedLab (http://tedlab.mit.edu/tedlab_website/Publications.html) in Brain and Cognitive Sciences at MIT is another example. Collaboration is key to avoiding some of the shortfalls, but the day when linguists like me, with the mathematical skills of a donkey, can publish without experimental and statistical support are fading quickly.

I think those of us in the mathematically Luddite generation should work harder to understand and to engage in collaborative efforts like those of Dunn, Atkinson, etc.

Dan

On Apr 21, 2011, at 12:41 PM, Bill Croft wrote:

I will not discuss the content of Matthew's response here, as I have discussed it with him privately. I agree with his main point, that Atkinson should control for large-scale areal influence. I also agree with Ian that once one opens the door to one geographical correlation hypothesis, one needs to consider others as well.

However, I strongly disagree with the implications of Matthew's polemical remarks:

"But the big problem with the Atkinson paper and others like it is precisely that nonlinguists who are experts on statistics do not understand the peculiar nature of crosslinguistic data...Linguists should be very wary of seeking the advice of nonlinguists regarding statistics."

I don't know if this is what Matthew intended, but this sounds very much as if (cross)linguistic data is somehow immune to the laws of statistics, and hence linguists need not concern themselves with developments in statistics (especially since such developments are the work of nonlinguists).
Linguistic data is not "peculiar". Linguistic data, like other data from human behavior and other complex systems, is the product of stochastic processes influenced by a variety of factors that causally interact. Our task is to identify the relevant factors and determine their influence, if any. That can be done by a range of statistical methods and models, which for example can deal with large-scale areal influence in the phoneme inventory data if we think it should.
Of course, identifying the relevant factors depends on the causal models that we propose to account for the behavior. Atkinson has a causal model, which leads him to bring in the factors that he does (and he controls for quite a number of plausible confounding factors, though not area, if you read his supplementary materials). The problem is, we linguists do not believe in the causal model, so we don't think distance from Africa, or even population size, should be the only additional factors considered in the statistical analysis. But linguists don't all agree on causal models of language behavior either. (Note that Dunn et al. are a team of linguists as well as nonlinguists.) And sometimes we have to look outside the box and consider other possibilities, as Hay and Bauer (2007) did - even if they turn out to be artifacts.

I think that linguists should learn more about statistics. Many of the posts about Atkinson's paper at the Language Log, the NY Times article, and on Funknet do not recognize some basic statistical principles. Even the detailed and carefully reasoned posts would benefit from more detailed knowledge of statistics, I believe. I say that for myself as well, of course. For instance, I have been told (via a psycholinguist) that the puzzle I discussed, the possibility of different correlations in a sample and in partitions of the sample, has a name in statistics, Simpson's Paradox. I checked the indexes of the two statistics textbooks I have by linguists (Woods et al. 1986 and Baayen 2008), and my wife's university statistics textbook (Hays 1988); none of them listed Simpson's Paradox. I'm afraid that for me or any linguist to learn more about statistics means reading books written by nonlinguists, taking courses from nonlinguists, and/or consulting with nonlinguists.

The response by linguists to Dunn et al. and Atkinson has been uniformly negative. Many have also been arrogant, condescending and dismissive. The attitude appears to be that any work on language by nonlinguists, especially that using fancy statistics, is completely wrong. That is why I have felt obliged to defend those aspects of both papers that I think are positive, and to question some of the criticisms. This doesn't mean that I endorse their results: I don't, in the case of Dunn et al., and I am uncertain about Atkinson. But I think that the problems with Dunn et al. and with Atkinson are quite different - linguistically and statistically - and that it is worth linguists recognizing and understanding these differences.

Bill

Baayen, R. Harald. 2008. Analyzing linguistic data: a practical introduction to statistics using R. Cambridge: Cambridge University Press.
Hays, William L. 1988. Statistics (4th ed.). New York: Holt, Rinehart and Winston.

Woods, Anthony, Paul Fletcher & Arthur Hughes. 1986. Statistics in language studies. Cambridge: Cambridge University Press.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20110421/d87d4ad7/attachment.htm>