[Lingtyp] Greenbergian word order universals: confirmed after all

Sat Dec 23 14:59:51 UTC 2023

Pada Sen, 6 Nov 2023 pukul 12.44 Simon Greenhill <
simon.greenhill at auckland.ac.nz> menulis:

> Hi Juergen,
>
> The key problem we need to solve here is that languages are not
> statistically independent. They share similarity due to inheritance or
> contact. This is what statisticians call autocorrelation in general, and
> phylogenetic and spatial autocorrelation more specifically.
>
> There are many ways to deal with this in the literature, and we often
> can't separate them neatly into stratified vs. phylogenetic methods as they
> overlap. However, I'm a phylogenetics hard-liner, I guess, so let me give
> you that hardline response:
>
> I see three major reasons for preferring phylogenetic methods over
> stratified sampling. One is practical, one is theoretical, one is
> methodological.
>
> 1. Practical: Stratified sampling does not remove the auto-correlation
> caused by inheritance and contact.
>
> Look at the anthropological literature. They spent a long time trying to
> come up with a "standard cross-cultural sample" to avoid the problem of
> autocorrelation. They eventually came up with a rule of selecting societies
> more than '200 miles and 200 years' apart. It failed -- this sample still
> shows strong autocorrelation:
>
>
> https://www.researchgate.net/publication/242148557_Does_Mr_Galton_Still_Have_a_Problem_Autocorrelation_in_the_Standard_Cross-Cultural_Sample
>
> ...and I'm sure that linguistics has this problem worse than anthropology
> because linguistic data seems to change slower than cultural data.
>
> What this means is that any statistical that assumes the data points are
> independent will give incorrect results. If you want to get the right
> answer you need to deal with this. Pretending it's solved by stratification
> and crossing your fingers is not a solution.
>

(Sorry for late responding to an old thread --- someone raised this issue
with me the other day citing the thread).

Actually typological data from a simple genealogically stratified
sample usually does not exhibit strong (spatial) autocorrelation,
although this of course depends on what one means by "strong".

For example, if you just take one (random) language per Glottolog
family, i.e., stratifying by family, and check, e.g., Grambank data,
and the spatial weights formula from the cited Eff 2004 paper, this
gives samples of 200-300 datapoints depending on GB feature. Most
features exhibit a rather low Moran's I (see attached histogram). One
can probably improve on this by thinning the sample to get larger
geographical distances or by selecting geographically distant
languages from different families, rather than random ones.

Eff (2004) uses an (arbitrary) threshold of 0.1 and finds that 44% of
the studied variables on the SCCS exhibits spatial autocorrelation,
which is similar to one-per-family on Grambank data (49% of the
features, or slightly lower, 46%, with row-normalization of the
spatial weights). Instead using a threshold value it may be more
revealing to gauge significance by permutation tests. Then only 51 GB
features show significant spatial autocorrelation at the level of 0.05
and only 21 at the level of 0.001 --- and that's before correcting for
multiple testing.

So for many typological features one can indeed just stratify by
family and expect the remaining areal effects not to be very strong
or, better, also address the remaining areal effects. Phylogenetic
methods do not address areality at all so cannot be a solution
and typically suffer a worse version of Galton's problem anyway. In
a genealogically stratified world-wide sample, the areal effects will
be *different* across the datapoints (and one might hope they
even cancel out) but on phylogenetic study on one or a few families it
can be *one and the same* areal effect that targets the datapoints.

all the best, H
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20231223/0131c7be/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: moranigb.png
Type: image/png
Size: 17764 bytes
Desc: not available
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20231223/0131c7be/attachment.png>