[Lingtyp] Greenbergian word order universals: confirmed after all

Fri Nov 3 08:07:46 UTC 2023

Dear Randy,

As you mention the study by Zhang et al. and the reliability of the 
data, and since this is quite important here, I think it is worthwhile 
to point to the discrepancy between the data that you have in the form 
of an etymological dictionarly like STEDT, and the data that goes to the 
computers.

Here, the study by Zhang et al. shows several huge problems that we had 
been looking into at some point, but no time to follow up. The most 
important problem is the fact that STEDT was not made for these 
analyses, and that we have in fact no real 100-item Swadesh list, and 
even no data in no state where you would be able to check individual 
roots and how they evolved along the tree. All is lost in the numbers, 
nobody knows if the coding had flaws or was done nicely, we will never 
know, due to the problematic coding procedure followed in the study, 
overlooked by all reviewers.

So we can observe another, an additional problem with using phylogenetic 
databases or typological databases: experts in one domain (etymology), 
like you, Randy, do often not have the possibility or the expertise in 
the other domain (data coding) so that a lot can be lost when trusting a 
database without checking how data was turned into numbers.

For the future, we need more common expertise of people who know both 
domains, similar to the field of evolutionary biology, where people 
understand major processes of language change (for example) while at the 
same time being able to understand how original data is turned into 
numerical representations in order to test for results computationally.

Best,

Mattis

Am 03.11.23 um 07:10 schrieb Randy LaPolla:
> Hi Martin and all,
> Over the years I have been asked by Nature to review a number of these 
> papers that use a  Bayesian-based algorithm (usually the same exact 
> one)—there has been a fad of such papers, and my response is almost 
> always the same: they have used a method (lexicostatistics) long ago 
> discredited in linguistics, but sometimes come up with results quite 
> similar to the results found by more empirical traditional studies. As 
> their valid  results are never new, the only thing worth mentioning is 
> the methodology, as Jürgen pointed out. The methodology fails sometimes, 
> though, and there are two crucial aspects why it does: the only thing 
> that varies among all these studies is what database they use and how 
> they set the priors, which can greatly bias the outcome. The one such 
> study I supported was by Zhang Menghan et al. in 2019, as it used a very 
> reliable database (Matisoff’s Sino-Tibetan Etymological Dictionary and 
> Thesaurus—developed over 30 years) and did not set any priors that would 
> have biased the outcome. Most of the others use problematic datasets, 
> and as the old saying goes, Garbage in, garbage out.
> 
> Randy
> 
>> On Nov 2, 2023, at 22:22, Martin Haspelmath 
>> <martin_haspelmath at eva.mpg.de> wrote:
>>
>> 
>>
>> Dear all,
>>
>> Twelve years ago, for the first (and so far last) time, typology made 
>> it into /Nature/, and /BBC Online/ reported at the time: “A 
>> long-standing idea that human languages share universal features that 
>> are dictated by human brain structure has been cast into doubt.” 
>> (https://www.bbc.com/news/science-environment-13049700). Our journal 
>> /Linguistic Typology/ took this as an opportunity to publish a 
>> “Universals Debate” taking up 200 pages 
>> (https://www.degruyter.com/document/doi/10.1515/lity.2011.023/html). 
>> Younger LINGTYP readers may not remember all this, but a lot of stir 
>> was caused at the time by the paper by Dunn et al. (2011), which 
>> claimed that "systematic linkages of traits are likely to be the rare 
>> exception rather than the rule. Linguistic diversity does not seem to 
>> be tightly constrained by universal cognitive factors“ 
>> (https://www.nature.com/articles/nature09923). Their paper argued not 
>> only against Chomskyan UG (universal grammar), but also against the 
>> Greenbergian word order universals (Dryer 1992).
>>
>> In the meantime, however, it has become clear that those surprising 
>> claims about word order universals are not supported – the sample size 
>> (four language families) used in their paper was much too small.
>>
>> Much less prominently, Jäger & Wahle (2021) reexamined those claims 
>> (using similar methods, but many more language families and all 
>> relevant /WALS/ data), finding “statistical evidence for 13 word order 
>> features, which largely confirm the findings of traditional 
>> typological research” 
>> (https://www.frontiersin.org/articles/10.3389/fpsyg.2021.682132/full).
>>
>> Similarly, Annemarie Verkerk and colleagues (including Russell Gray) 
>> have recently reexamined a substantial number of claimed universals on 
>> the basis of the much larger Grambank database and found that 
>> especially Greenberg’s word order universals hold up quite well (see 
>> Verkerk’s talk at the recent Grambank workshop at MPI-EVA: 
>> https://www.eva.mpg.de/de/linguistic-and-cultural-evolution/events/2023-grambank-workshop/, available on YouTube: https://www.youtube.com/playlist?list=PLSqqgRcaL9yl8FNW_wb8tDIzz9R78t8Uk).
>>
>> So what went wrong in 2011? We are used to paying a lot of attention 
>> to the “big journals” (/Nature, Science, PNAS, Cell/), but they often 
>> focus on sensationalist claims, and they typically publish less 
>> reliable results than average journals (see Brembs 2018: 
>> https://www.frontiersin.org/articles/10.3389/fnhum.2018.00037/full).
>>
>> So maybe we should be extra skeptical when a paper is published in a 
>> high-prestige journal. But another question that I have is: Why didn’t 
>> the authors see that their 2011 results were unlikely to be true, and 
>> that their sample size was much too small? Why didn't they notice that 
>> most of the word order changes in their four families were 
>> contact-induced? Were they so convinced that their new mathematical 
>> method (adopted from computational biology) would yield correct 
>> results that they neglected to pay sufficient attention to the data? 
>> Would it have helped if they had submitted their paper to a 
>> linguistics journal?
>>
>> Perhaps I’m too pessimistic (see also this blogpost: 
>> https://dlc.hypotheses.org/2368), but in any event, I think that this 
>> intriguing episode (and sobering experience) should be discussed among 
>> typologists, and we should learn from it, in one way or another. 
>> Advanced quantitative methods are now everywhere in science, and it 
>> seems that they are often misapplied or misunderstood (see also this 
>> recent blogpost by Richard McElreath: 
>> https://elevanth.org/blog/2023/06/13/science-and-the-dumpster-fire/).
>>
>> Martin
>>
>> -- 
>> Martin Haspelmath
>> Max Planck Institute for Evolutionary Anthropology
>> Deutscher Platz 6
>> D-04103 Leipzig
>> https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
>> _______________________________________________
>> Lingtyp mailing list
>> Lingtyp at listserv.linguistlist.org
>> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
> 
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp