[Lingtyp] Greenbergian word order universals: confirmed after all

Mon Nov 6 17:37:15 UTC 2023

> On 6. Nov 2023, at 12:44, Simon Greenhill <simon.greenhill at auckland.ac.nz> wrote:

> 1. Practical: Stratified sampling does not remove the auto-correlation caused by inheritance and contact. 

The standard reference for this problem from the linguistic typology literature is Perkins (1989): the conclusion of that paper is to use a sample of maximally 50-100 language “to balance the requirements for representativeness and independence in samples” (p. 312).

> 2. Theoretical: What *exactly* do you stratify on? 

Perkins (1989) already suggests in the conclusion that it is interesting to compare variation within groups of related languages to variation between groups, but there is no concrete method proposed (p. 312-313).

The approach to use fine-grained phylogenies is really great, but from a practical point of view in linguistic typology it is often difficult to obtain detailed data for enough closely-related languages. As mentioned already in this discussion, Maslova (2002, 2008) has proposed a very simple method to just use pairs of closely related languages. This method is really rough, but it seems to be useful for a first approximation. 

Dediu & Cysouw (2013) compared her approach with other methods, and Maslova’s approach did performed rather nicely. In the appendix (S1) to that paper there is a short summary of Maslova’s (astonishingly simple) computations, which are somewhat difficult to extract from her manuscripts (she explained it to me personally long ago when we met in Leipzig).

A first attempt to use Maslova's method on WALS data can be found in Cysouw (2007). The basic conclusion of that paper was that there are indeed situations in which a ’traditional’ sampling approach shows influence from historical coincidences. For example, the apparent (positive!) correlation between tone and vowel inventory size (Maddieson 2013) seems to be spurious. However, throughout all the WALS data, the predicted stable distributions seem to correlate strongly with the actual sampled distributions, indicating that the traditional approach to sampling is not completely wrong.

> 3. Methodological: Phylogenetic methods get you closer to causality.

However, as Simon also argued: the great promise of phylogenetic sampling approach (including Maslova’s simple approximation) is to be able to induce tendencies of change from static data. That seems to be the right approach to me (cf. my opinion in Cysouw 2011).

best
Michael

=======

References

- Cysouw, Michael. 28 September 2007. Investigating Transition Probabilities in the World Atlas of Language Structures. Paris, France (Presentation at ALT VII). http://www.cysouw.de/home/presentations_files/cysouwALT7TRANSITION_slides.pdf
- Cysouw, Michael. 2011. Understanding transition probabilities. Linguistic Typology 15(2). 415-431. https://doi.org/10.1515/lity.2011.028
- Dediu, Dan & Michael Cysouw. 2013. Some Structural Aspects of Language are More Stable than Others: A Comparison of Seven Methods. PLoS One 8(1). https://doi.org/10.1371/journal.pone.0055009
- Maddieson, Ian. 2013. Tone. In: Dryer, Matthew S. & Haspelmath, Martin (eds.) WALS Online. http://wals.info/chapter/13
- Maslova, Elena. 2002. Distributional universals and the rate of type shifts: towards a dynamic approach to “probability sampling”. http://anothersumma.net/Publications/Sampling.pdf. Lecture given at the 3rd Winter Typological School, Moscow. 
- Maslova, Elena & Tatiana Nikitina. 2008. Stochastic universals and dynamics of cross-linguistic distributions: the case of alignment. http://www.anothersumma.net/Publications/Ergativity.pdf
- Perkins, Revere D. 1989. Statistical techniques for determining language sample size. Studies in Language 13(2). 293–315.

————————
Prof. Dr. Michael Cysouw
Forschungszentrum Deutscher Sprachatlas
Philipps Universität Marburg
Pilgrimstein 16
D-35032 Marburg

Office: +49-6421-28-22488
Secretary: +49-6421-28-22483
Email: cysouw at uni-marburg.de
Web: www.deutscher-sprachatlas.de/mitarbeiter/cysouw/
Web: www.cysouw.de/home/
ORCID: orcid.org/0000-0003-3168-4946

Standort Biegenstrasse, Gebäude B|05
Pilgrimstein 16, Raum 106 (+1/0060)
http://www.uni-marburg.de/kontakt/
————————

> 
> 
> Hi Juergen,
> 
> The key problem we need to solve here is that languages are not statistically independent. They share similarity due to inheritance or contact. This is what statisticians call autocorrelation in general, and phylogenetic and spatial autocorrelation more specifically. 
> 
> There are many ways to deal with this in the literature, and we often can't separate them neatly into stratified vs. phylogenetic methods as they overlap. However, I'm a phylogenetics hard-liner, I guess, so let me give you that hardline response: 
> 
> I see three major reasons for preferring phylogenetic methods over stratified sampling. One is practical, one is theoretical, one is methodological.
> 
> 
> 
> Look at the anthropological literature. They spent a long time trying to come up with a "standard cross-cultural sample" to avoid the problem of autocorrelation. They eventually came up with a rule of selecting societies more than '200 miles and 200 years' apart. It failed -- this sample still shows strong autocorrelation:
> 
> https://www.researchgate.net/publication/242148557_Does_Mr_Galton_Still_Have_a_Problem_Autocorrelation_in_the_Standard_Cross-Cultural_Sample
> 
> ...and I'm sure that linguistics has this problem worse than anthropology because linguistic data seems to change slower than cultural data. 
> 
> What this means is that any statistical that assumes the data points are independent will give incorrect results. If you want to get the right answer you need to deal with this. Pretending it's solved by stratification and crossing your fingers is not a solution.
> 
> 
> 2. Theoretical: What *exactly* do you stratify on? 
> 
> To effectively create a robust stratified sample you need to identify the statistically independent units. One option is language families: Select (say) one per family. But then what does this mean when you take one language from a family like Austronesian with ~1300 languages and a one from a family like Eastern Trans-Fly with 4 languages. This means that you've sampled 0.0007% of Austronesian but 1/4 of ETF. This feels wrong.
> 
> Ok, why not select one per genera? But then you need to decide what that a genera is: are the Central Pacific languages the *same* as the Romance languages the *same* as Bodic languages? Not really, no. Maybe you could implement a rule like "every group larger than 30 languages" or "everything about as old as Romance", but this is hard to do, and it's just smuggling a phylogeny through the back door, so you may as well just do it properly.
> 
> Another option is to sample based on distance -- let's choose one language every x kilometers. But then languages are different sizes, so this means we might sample a geographically wide-spread language like English many times.
> 
> Do we sample one language per country? No, because some countries have many languages while others have a few. And languages, like species, follow Rappoport's rule such that there are more languages near the equator, and languages at higher latitudes cover larger areas. This means that New Guinea has something like 900 languages while Turkey -- about the same size as the island of New Guinea -- has only 40 or so languages. So this would mean that taking one language from New Guinea and one from Turkey is sampling 0.1% vs. 2.5%. A few orders of magnitude difference.
> 
> Maybe you could do some equal-area map projection and sample accordingly, but then you have problems that some language groups have spread very far: if you sample from Taiwan and Hawaii, you'll probably get two Austronesian languages, or if you pick any two languages south of the Sahara you'll probably get two Bantu languages.
> 
> Maybe you could combine all these somehow.. but then you're running a massively complex model on a small amount of data and then your statistical power has just dropped down through the basement, so you're wasting your time.
> 
> 
> 
> 3. Methodological: Phylogenetic methods get you closer to causality.
> 
> I think that a major argument for preferring phylogenetic methods over stratified sample is that you can do more things. Looking at the methods used in Dunn et al., we can see that they infer correlation, but also infer temporal stability (time in state), and transition rates (rates at which changes occur). 
> 
> These tools therefore get you much closer to understanding the causal processes that underly the pattern you're looking at it. Often, all a stratified sample can say is "x and y are correlated", while phylogenetic methods can give information about whether "x gives rise to y or y gives rise to x". This leads to a deeper understanding of what's going on. Perhaps you could get this by combining stratified sampling with some causal graph approach but you still have the other problems I mentioned.
> 
> Maybe it's cheaper and easier to use a stratified sample. But cheaper and easier is not always better. Sometimes there's no alternative to doing things properly.
> 
> 
> --Simon
> 
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp