I will not discuss the content of Matthew's response here, as I have
discussed it with him privately. I agree with his main point, that
Atkinson should control for large-scale areal influence. I also agree
with Ian that once one opens the door to one geographical correlation
hypothesis, one needs to consider others as well.
However, I strongly disagree with the implications of Matthew's
polemical remarks:
"But the big problem with the Atkinson paper and others like it is
precisely that nonlinguists who are experts on statistics do not
understand the peculiar nature of crosslinguistic data...Linguists
should be very wary of seeking the advice of nonlinguists regarding
statistics."
I don't know if this is what Matthew intended, but this sounds very
much as if (cross)linguistic data is somehow immune to the laws of
statistics, and hence linguists need not concern themselves with
developments in statistics (especially since such developments are
the work of nonlinguists).
Linguistic data is not "peculiar". Linguistic data, like other data
from human behavior and other complex systems, is the product of
stochastic processes influenced by a variety of factors that causally
interact. Our task is to identify the relevant factors and determine
their influence, if any. That can be done by a range of statistical
methods and models, which for example can deal with large-scale areal
influence in the phoneme inventory data if we think it should.
Of course, identifying the relevant factors depends on the causal
models that we propose to account for the behavior. Atkinson has a
causal model, which leads him to bring in the factors that he does
(and he controls for quite a number of plausible confounding factors,
though not area, if you read his supplementary materials). The
problem is, we linguists do not believe in the causal model, so we
don't think distance from Africa, or even population size, should be
the only additional factors considered in the statistical analysis.
But linguists don't all agree on causal models of language behavior
either. (Note that Dunn et al. are a team of linguists as well as
nonlinguists.) And sometimes we have to look outside the box and
consider other possibilities, as Hay and Bauer (2007) did - even if
they turn out to be artifacts.
I think that linguists should learn more about statistics. Many of
the posts about Atkinson's paper at the Language Log, the NY Times
article, and on Funknet do not recognize some basic statistical
principles. Even the detailed and carefully reasoned posts would
benefit from more detailed knowledge of statistics, I believe. I say
that for myself as well, of course. For instance, I have been told
(via a psycholinguist) that the puzzle I discussed, the possibility
of different correlations in a sample and in partitions of the
sample, has a name in statistics, Simpson's Paradox. I checked the
indexes of the two statistics textbooks I have by linguists (Woods et
al. 1986 and Baayen 2008), and my wife's university statistics
textbook (Hays 1988); none of them listed Simpson's Paradox. I'm
afraid that for me or any linguist to learn more about statistics
means reading books written by nonlinguists, taking courses from
nonlinguists, and/or consulting with nonlinguists.
The response by linguists to Dunn et al. and Atkinson has been
uniformly negative. Many have also been arrogant, condescending and
dismissive. The attitude appears to be that any work on language by
nonlinguists, especially that using fancy statistics, is completely
wrong. That is why I have felt obliged to defend those aspects of
both papers that I think are positive, and to question some of the
criticisms. This doesn't mean that I endorse their results: I don't,
in the case of Dunn et al., and I am uncertain about Atkinson. But I
think that the problems with Dunn et al. and with Atkinson are quite
different - linguistically and statistically - and that it is worth
linguists recognizing and understanding these differences.
