[Lingtyp] Reporting cross-linguistic frequencies

Matías Guzmán Naranjo mguzmann89 at gmail.com
Fri Nov 21 12:37:33 UTC 2025


Dear Peter,

I'm a bit biased because most of my work centers around developing new 
statistical methods, but I am very sympathetic to the skepticism that 
you raise, especially given the lack of consensus across researchers, as 
well as other technical and non-technical issues. Experimental 
techniques can get around several drawbacks of statistical modeling of 
large-scale typological samples, and they also provide different types 
of insight, but they have their own set of drawbacks and problems. 
Replacing statistical modeling of typological samples with experiments 
on a small sample of languages does not get rid of uncertainty; it just 
changes what we're uncertain about. If the aim is to find 'universal 
tendencies', I am not sure that the answer should be to completely avoid 
a set of methods. Rather, we should hope for agreeing results across 
different approaches while trying to understand the weaknesses in each.

 > the whole enterprise does not appear to be very productive

I don't disagree with this sentiment, but I wonder whether this is an 
inherent property of the enterprise or a result of us still trying to 
figure out what works, what doesn't, and how we should interpret our 
results. To me, it feels a bit early to fully give up on it.

Matías

El 21/11/25 a las 11:59, Peter Arkadiev via Lingtyp escribió:
> Dear Martin, dear all,
> I am starting to wonder whether statistical analysis of a language 
> sample is at all a suitable method for "detecting universal tendencies 
> that are caused by universal/non-historical factors" (Martin's 
> formulation). Given that there is no consensus as for how to overcome 
> genealogical and areal biases and even whether those biases must be 
> overcome at all and what trying to overcome them actually gets us 
> (apart from getting some of us high-profile publications with ever 
> more complicated mathematical apparatus which others among us struggle 
> to understand and cannot evaluate; not being in any way a 
> "mathematically-gifted person", to borrow Stela's expression, I belong 
> to the latter group), the whole enterprise does not appear to be very 
> productive. What if the more appropriate method, at least if purported 
> functional factors are being concerned, is the one employed by John 
> Hawkins, Natalia Levshina and some others, i.e. to combine 
> experimental research on production / processing with a quantitative 
> study of variation in corpora across a small number of sufficiently 
> distinct languages? If we can show that certain well-defined factors 
> are operative in language processing and result in skewed 
> distributions in corpora ultimately translatable into tendencies of 
> diachronic change, and moreover are able to corroborate these results 
> by similarly skewed distributions of variables in reasonably designed 
> cross-linguistic samples, then what else do we need? In any case, as 
> has been repeatedly stated many times, even if we find that in a 
> certain language sample, however well-designed, a certain variable 
> shows a clearly skewed distribution of, say 80% vs 20%, nothing 
> follows from this in terms of "universal preferences" unless we are 
> able to independently show that the more frequent value is in some or 
> other way "preferred" in processing / production etc. I am sorry if 
> the above is self-evident or naive.
> Best regards,
> Peter
> ----------------
> Кому: lingtyp at listserv.linguistlist.org 
> (lingtyp at listserv.linguistlist.org);
> Тема: [Lingtyp] Reporting cross-linguistic frequencies;
> 21.11.2025, 10:19, "Martin Haspelmath via Lingtyp" 
> <lingtyp at listserv.linguistlist.org>:
>
>     Thanks, Jürgen! I like the "wave vs. particle" analogy, because
>     these concrete expressions help us make sense of what seems to be
>     going on (given the experimental results).
>
>     In worldwide comparative linguistics, we also want to make sense
>     of what is going on, but it seems to me that we need analogies not
>     only for interpreting results, but also for understanding what we
>     are aiming for. For me, "removing areal and
>     genealogical/phylogenetic bias" has the aim of detecting universal
>     tendencies that are caused by universal/non-historical factors.
>
>     I would think that on the imagined concrete scenario of a sample
>     of isolated isolates (e.g. 100 languages that have long existed on
>     isolated islands, maybe of the Rapanui type), looking at these 100
>     isolates should give the same results as looking at 100 sample
>     languages from larger families that have been shaped also by contact.
>
>     Are there reasons to doubt this? If not, then we can take the
>     "isolated isolates" scenario simply as a way of illustrating our
>     goals in concrete terms (somewhat like "wave" and "particle" serve
>     as concrete illustrations).
>
>     But maybe the imagined scenario (which is not an "assumption"!!)
>     is somehow problematic, because the goals of our enterprise are
>     DIFFERENT. In Bickel's (2007) paper (LiTy 11), which has been
>     widely cited, the idea seems to be that looking for "history-free"
>     tendencies is somehow an obsolete goal.
>
>     Some people have suggested that in identifying universal trends,
>     one MUST take into account genealogies, and isolates are
>     problematic because they are not part of any genealogy. This is
>     because we should not look primarily at languages, but at
>     *transitions* (changes from one type to another). If I understood
>     Verkerk et al. (2025) correctly, they solved the "isolates
>     problem" by using an artificial world tree (where all languages
>     are somehow included; the very beautiful tree is used in the press
>     release
>     <https://www.mpg.de/25723124/1114-evan-enduring-patterns-in-the-world-s-languages-150495-x>).
>     Are Verkerk et al. pursuing a different goal? That is not really
>     clear to me.
>
>     I find the notion of an artificial world tree profoundly strange,
>     much stranger than the hypothetical scenario of 100 isolates on
>     remote islands. But maybe it is needed, because the goal of the
>     enterprise is somehow different (along Bickel's lines)? So I like
>     the imagined "isolated isolates" scenario also because it
>     clarifies what I'm interested in.
>
>     (And isn't Trudgill's idea that isolates are somehow "exotic" very
>     speculative? Shcherbakova et al. 2023 have not provided strong
>     evidence against the idea, but they simply did not find evidence
>     in favour of it.)
>
>     One last point: Yes, all isolates are survivors from some larger
>     family, but why is that relevant? Languages may have existed for
>     half a million years or longer, and we know almost nothing about
>     that deep past. Most of the currently existing families probably
>     had more branches in earlier times, and the protolanguages we
>     reconstruct may or may not have been isolates themselves. We
>     cannot tell, but I don't see why we would need to know.
>
>     Best,
>
>     Martin
>
>     On 21.11.25 07:07, Juergen Bohnemeyer via Lingtyp wrote:
>
>         Dear all — Here’s a quick explanation of why the assumption of
>         an “isolated isolate” is profoundly strange:
>
>         Leaving aside sign languages, constructed languages, and
>         artificial languages, nobody seems to entertain the
>         possibility that languages have emerged spontaneously out of
>         something that we wouldn’t consider a language itself over the
>         last few thousands of years. In other words, the languages we
>         consider isolates are without exception lone survivors; but
>         they did descend from  ancestors which are often lost and
>         unknown, and these ancestors biased the offshoot's properties
>         by dint of inheritance/transmission.
>
>         The reason isolates are interesting from a sampling
>         perspective is that they may represent entire genera or
>         families without forcing us to pick a member. But being an
>         isolate does not mean being free of phylogenetic bias. On the
>         contrary: isolates of unknown descend are actually
>         particularly problematic in the sense that they are shaped by
>         biases that we have no way of identifying directly since the
>         biasing ancestors have been lost to time.
>
>         As to contact. Languages that are not in contact with other
>         languages over long stretches of time are extremely rare and
>         unusual. In fact, as I’m sure everyone here is aware, such
>         languages have been plausibly argued to tend to evolve exotic
>         properties as a result of their isolation (Lupyan & Dale 2010;
>         Trudgill 2011), although this is controversial (Shcherbakova
>         et al. 2023). In any case, I would certainly not want to make
>         such languages the basis for causal inference in typology.
>
>         But it gets a lot worse. The “isolated isolate” interpretation
>         doesn’t just require us to think of a language that isn’t
>         currently in contact with any other language. We would have to
>         assume a language that has *never*​ come into contact with any
>         other language at any point in its history (at least not
>         long/intensively enough to change as a result of it). I’m
>         seriously uncertain whether such a language has ever existed
>         on this planet.
>
>         Here’s an analogy from quantum mechanics: Schrödinger’s and
>         Heisenberg’s equations are mathematical models that describe
>         the experimentally observed behavior of elementary particles
>         under various conditions. The particle and the wave
>         interpretation are interpretations that we use to make sense
>         of these mathematical models. We find these models useful
>         because most of us don’t think in mathematical equations (not
>         even theoretical physicists, it would seem). But if we push
>         these interpretations beyond a certain point, they break down.
>         To begin with, we can’t think of something simultaneously as a
>         wave and as a particle.
>
>         In the same way, we can mathematically describe the influence
>         phylogeny and areality exert on the probability of a
>         particular language having certain properties. The “isolated
>         isolate” interpretation is just that - an interpretation of
>         the statistical models; but, as I tried to show above, it runs
>         into absurdities rather more quickly than the particle and
>         wave interpretations in quantum mechanics.
>
>         Best — Juergen
>
>         G. Lupyan, R. Dale, Language structure is partly determined by
>         social structure. PLOS ONE5, e8559 (2010).
>
>         O. Shcherbakova, S. M. Michaelis, H. J. Haynie, et al.
>         Societies of strangers do not speak less complex languages.
>         /Scientific Advances /9, eadf7704 (2023).
>
>         P. Trudgill, /Sociolinguistic Typology: Social Determinants of
>         Linguistic Complexity /(OxfordUniv. Press, 2011).
>
>         Juergen Bohnemeyer (He/Him)
>         Professor, Department of Linguistics
>         University at Buffalo
>
>         Office: 642 Baldy Hall, UB North Campus
>         Mailing address: 609 Baldy Hall, Buffalo, NY 14260
>         Phone: (716) 645 0127
>         Fax: (716) 645 3825
>         Email: _jb77 at buffalo.edu <mailto:jb77 at buffalo.edu>_
>         Web: _http://www.acsu.buffalo.edu/~jb77/
>         <http://www.acsu.buffalo.edu/~jb77/>_
>
>         Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom
>         (Meeting ID 585 520 2411; Passcode Hoorheh)
>
>         There’s A Crack In Everything - That’s How The Light Gets In
>         (Leonard Cohen)
>
>         -- 
>
>         *From: *Lingtyp <lingtyp-bounces at listserv.linguistlist.org>
>         <mailto:lingtyp-bounces at listserv.linguistlist.org> on behalf
>         of Matías Guzmán Naranjo via Lingtyp
>         <lingtyp at listserv.linguistlist.org>
>         <mailto:lingtyp at listserv.linguistlist.org>
>         *Date: *Thursday, November 20, 2025 at 04:01
>         *To: *lingtyp at listserv.linguistlist.org
>         <mailto:lingtyp at listserv.linguistlist.org>
>         <lingtyp at listserv.linguistlist.org>
>         <mailto:lingtyp at listserv.linguistlist.org>
>         *Subject: *Re: [Lingtyp] Reporting cross-linguistic frequencies
>
>         I'll jump in with some thoughts.
>
>
>         - Dryer's method and ours aim at doing basically the same thing:
>         quantifying what's "left" after removing genetic and areal bias.
>
>         - Whether you should call them proportions or adjusted
>         frequencies...
>         I'm not sure that it matters that much? As long as you
>         understand how
>         they were calculated...
>
>         - How you want to interpret this "what's left" is debatable, I
>         guess,
>         but I don't think I agree with Jürgen. As far as I can tell it
>         should be
>         compatible with something along the lines of an "isolated
>         isolate" as
>         described by Martin. You can also see them as 'universal'
>         preferences
>         (more or less the same thing?).
>
>         - "the probability of a random language having a certain property
>         depends on (or is influenced by, or varies with, etc.) it
>         being related
>         to certain other languages, or being  spoken (or signed) in a
>         particular
>         area" -> In our approach we assumes that the probability of a
>         language L
>         having some feature value F depends on three things: 1) its
>         relatedness
>         to other languages, 2) its contact to other languages, 3) some
>         universal
>         preference for F. Kind of the point of what we do is that we
>         try to
>         estimate each of these factors. [We can add more factors and more
>         structure, but that's the most basic model]
>
>         - You can quantify the contribution of the phylogenetic
>         component and
>         the areal component(s) with our techniques, but this is a bit
>         tricky
>         because there is unavoidable overlap in the information each one
>         contains. These measures also have a different meaning than
>         the adjusted
>         frequency and can't be used as a replacement for them, you can
>         use them
>         in addition to.
>
>
>         Matías
>
>
>
>         El 20/11/25 a las 9:36, Omri Amiraz via Lingtyp escribió:
>         > Dear all,
>         > I agree with Ian that, in addition to genealogical and areal
>         biases,
>         > the very question of what counts as a language versus a
>         dialect is
>         > partly subjective. This makes actual frequencies even more
>         > problematic, since we would obtain different results
>         depending on
>         > whether we treat Wu Chinese as one language or as thirty
>         separate
>         > languages, as Ian pointed out.
>         > Juergen wrote: "We can empirically assess the extent to
>         which the
>         > probability of a random language having a certain property
>         depends on
>         > (or is influenced by, or varies with, etc.) it being related to
>         > certain other languages, or being  spoken (or signed) in a
>         particular
>         > area."
>         >
>         > I wonder whether it might be useful to have a measure of the
>         > genealogical and areal spread of a feature, essentially
>         quantifying
>         > how broadly it is distributed across families and regions in the
>         > present-day world. Such a measure might be more
>         straightforward to
>         > interpret than an adjusted frequency/probability, since it
>         is not
>         > clear whether the described population is a hypothetical set of
>         > isolated isolates or something else.
>         >
>         > Is anyone aware of an existing metric that captures
>         genealogical or
>         > areal spread in this way?
>         >
>         > Best,
>         > Omri
>         >
>         > _______________________________________________
>         > Lingtyp mailing list
>         > Lingtyp at listserv.linguistlist.org
>         <mailto:Lingtyp at listserv.linguistlist.org>
>         >
>         https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C88b1df86321b4cb12f9f08de28135c96%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638992260962407959%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uY52%2BPtTVyzNB0LIowvZ0UzKWB6MWLR%2BG62V70JtNGE%3D&reserved=0
>         <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>         _______________________________________________
>         Lingtyp mailing list
>         Lingtyp at listserv.linguistlist.org
>         <mailto:Lingtyp at listserv.linguistlist.org>
>         https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C88b1df86321b4cb12f9f08de28135c96%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638992260962443120%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=X%2F1JMgRNS%2Bn0ZlGa7pPdsJWJBoJy%2BYJt6bHWktCMeRc%3D&reserved=0
>         <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>
>         _______________________________________________
>         Lingtyp mailing list
>         Lingtyp at listserv.linguistlist.org <mailto:Lingtyp at listserv.linguistlist.org>
>         https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>
>     -- 
>     Martin Haspelmath
>     Max Planck Institute for Evolutionary Anthropology
>     Deutscher Platz 6
>     D-04103 Leipzig
>     https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
>
>     ,
>
>     _______________________________________________
>     Lingtyp mailing list
>     Lingtyp at listserv.linguistlist.org
>     https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
>
> -- 
> Peter Arkadiev, PhD Habil.
> https://peterarkadiev.github.io/
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp


More information about the Lingtyp mailing list