[Lingtyp] Greenbergian word order universals: confirmed after all

Mon Nov 6 11:44:37 UTC 2023

Hi Juergen,

The key problem we need to solve here is that languages are not statistically independent. They share similarity due to inheritance or contact. This is what statisticians call autocorrelation in general, and phylogenetic and spatial autocorrelation more specifically. 

There are many ways to deal with this in the literature, and we often can't separate them neatly into stratified vs. phylogenetic methods as they overlap. However, I'm a phylogenetics hard-liner, I guess, so let me give you that hardline response: 

I see three major reasons for preferring phylogenetic methods over stratified sampling. One is practical, one is theoretical, one is methodological.

1. Practical: Stratified sampling does not remove the auto-correlation caused by inheritance and contact. 

Look at the anthropological literature. They spent a long time trying to come up with a "standard cross-cultural sample" to avoid the problem of autocorrelation. They eventually came up with a rule of selecting societies more than '200 miles and 200 years' apart. It failed -- this sample still shows strong autocorrelation:

https://www.researchgate.net/publication/242148557_Does_Mr_Galton_Still_Have_a_Problem_Autocorrelation_in_the_Standard_Cross-Cultural_Sample

...and I'm sure that linguistics has this problem worse than anthropology because linguistic data seems to change slower than cultural data. 

What this means is that any statistical that assumes the data points are independent will give incorrect results. If you want to get the right answer you need to deal with this. Pretending it's solved by stratification and crossing your fingers is not a solution.

2. Theoretical: What *exactly* do you stratify on? 

To effectively create a robust stratified sample you need to identify the statistically independent units. One option is language families: Select (say) one per family. But then what does this mean when you take one language from a family like Austronesian with ~1300 languages and a one from a family like Eastern Trans-Fly with 4 languages. This means that you've sampled 0.0007% of Austronesian but 1/4 of ETF. This feels wrong.

Ok, why not select one per genera? But then you need to decide what that a genera is: are the Central Pacific languages the *same* as the Romance languages the *same* as Bodic languages? Not really, no. Maybe you could implement a rule like "every group larger than 30 languages" or "everything about as old as Romance", but this is hard to do, and it's just smuggling a phylogeny through the back door, so you may as well just do it properly.

Another option is to sample based on distance -- let's choose one language every x kilometers. But then languages are different sizes, so this means we might sample a geographically wide-spread language like English many times.

Do we sample one language per country? No, because some countries have many languages while others have a few. And languages, like species, follow Rappoport's rule such that there are more languages near the equator, and languages at higher latitudes cover larger areas. This means that New Guinea has something like 900 languages while Turkey -- about the same size as the island of New Guinea -- has only 40 or so languages. So this would mean that taking one language from New Guinea and one from Turkey is sampling 0.1% vs. 2.5%. A few orders of magnitude difference.

Maybe you could do some equal-area map projection and sample accordingly, but then you have problems that some language groups have spread very far: if you sample from Taiwan and Hawaii, you'll probably get two Austronesian languages, or if you pick any two languages south of the Sahara you'll probably get two Bantu languages.

Maybe you could combine all these somehow.. but then you're running a massively complex model on a small amount of data and then your statistical power has just dropped down through the basement, so you're wasting your time.

3. Methodological: Phylogenetic methods get you closer to causality.

I think that a major argument for preferring phylogenetic methods over stratified sample is that you can do more things. Looking at the methods used in Dunn et al., we can see that they infer correlation, but also infer temporal stability (time in state), and transition rates (rates at which changes occur). 

These tools therefore get you much closer to understanding the causal processes that underly the pattern you're looking at it. Often, all a stratified sample can say is "x and y are correlated", while phylogenetic methods can give information about whether "x gives rise to y or y gives rise to x". This leads to a deeper understanding of what's going on. Perhaps you could get this by combining stratified sampling with some causal graph approach but you still have the other problems I mentioned.

Maybe it's cheaper and easier to use a stratified sample. But cheaper and easier is not always better. Sometimes there's no alternative to doing things properly.

--Simon