[Lingtyp] Reporting cross-linguistic frequencies
Matías Guzmán Naranjo
mguzmann89 at gmail.com
Fri Nov 21 12:37:33 UTC 2025
Dear Peter,
I'm a bit biased because most of my work centers around developing new
statistical methods, but I am very sympathetic to the skepticism that
you raise, especially given the lack of consensus across researchers, as
well as other technical and non-technical issues. Experimental
techniques can get around several drawbacks of statistical modeling of
large-scale typological samples, and they also provide different types
of insight, but they have their own set of drawbacks and problems.
Replacing statistical modeling of typological samples with experiments
on a small sample of languages does not get rid of uncertainty; it just
changes what we're uncertain about. If the aim is to find 'universal
tendencies', I am not sure that the answer should be to completely avoid
a set of methods. Rather, we should hope for agreeing results across
different approaches while trying to understand the weaknesses in each.
> the whole enterprise does not appear to be very productive
I don't disagree with this sentiment, but I wonder whether this is an
inherent property of the enterprise or a result of us still trying to
figure out what works, what doesn't, and how we should interpret our
results. To me, it feels a bit early to fully give up on it.
Matías
El 21/11/25 a las 11:59, Peter Arkadiev via Lingtyp escribió:
> Dear Martin, dear all,
> I am starting to wonder whether statistical analysis of a language
> sample is at all a suitable method for "detecting universal tendencies
> that are caused by universal/non-historical factors" (Martin's
> formulation). Given that there is no consensus as for how to overcome
> genealogical and areal biases and even whether those biases must be
> overcome at all and what trying to overcome them actually gets us
> (apart from getting some of us high-profile publications with ever
> more complicated mathematical apparatus which others among us struggle
> to understand and cannot evaluate; not being in any way a
> "mathematically-gifted person", to borrow Stela's expression, I belong
> to the latter group), the whole enterprise does not appear to be very
> productive. What if the more appropriate method, at least if purported
> functional factors are being concerned, is the one employed by John
> Hawkins, Natalia Levshina and some others, i.e. to combine
> experimental research on production / processing with a quantitative
> study of variation in corpora across a small number of sufficiently
> distinct languages? If we can show that certain well-defined factors
> are operative in language processing and result in skewed
> distributions in corpora ultimately translatable into tendencies of
> diachronic change, and moreover are able to corroborate these results
> by similarly skewed distributions of variables in reasonably designed
> cross-linguistic samples, then what else do we need? In any case, as
> has been repeatedly stated many times, even if we find that in a
> certain language sample, however well-designed, a certain variable
> shows a clearly skewed distribution of, say 80% vs 20%, nothing
> follows from this in terms of "universal preferences" unless we are
> able to independently show that the more frequent value is in some or
> other way "preferred" in processing / production etc. I am sorry if
> the above is self-evident or naive.
> Best regards,
> Peter
> ----------------
> Кому: lingtyp at listserv.linguistlist.org
> (lingtyp at listserv.linguistlist.org);
> Тема: [Lingtyp] Reporting cross-linguistic frequencies;
> 21.11.2025, 10:19, "Martin Haspelmath via Lingtyp"
> <lingtyp at listserv.linguistlist.org>:
>
> Thanks, Jürgen! I like the "wave vs. particle" analogy, because
> these concrete expressions help us make sense of what seems to be
> going on (given the experimental results).
>
> In worldwide comparative linguistics, we also want to make sense
> of what is going on, but it seems to me that we need analogies not
> only for interpreting results, but also for understanding what we
> are aiming for. For me, "removing areal and
> genealogical/phylogenetic bias" has the aim of detecting universal
> tendencies that are caused by universal/non-historical factors.
>
> I would think that on the imagined concrete scenario of a sample
> of isolated isolates (e.g. 100 languages that have long existed on
> isolated islands, maybe of the Rapanui type), looking at these 100
> isolates should give the same results as looking at 100 sample
> languages from larger families that have been shaped also by contact.
>
> Are there reasons to doubt this? If not, then we can take the
> "isolated isolates" scenario simply as a way of illustrating our
> goals in concrete terms (somewhat like "wave" and "particle" serve
> as concrete illustrations).
>
> But maybe the imagined scenario (which is not an "assumption"!!)
> is somehow problematic, because the goals of our enterprise are
> DIFFERENT. In Bickel's (2007) paper (LiTy 11), which has been
> widely cited, the idea seems to be that looking for "history-free"
> tendencies is somehow an obsolete goal.
>
> Some people have suggested that in identifying universal trends,
> one MUST take into account genealogies, and isolates are
> problematic because they are not part of any genealogy. This is
> because we should not look primarily at languages, but at
> *transitions* (changes from one type to another). If I understood
> Verkerk et al. (2025) correctly, they solved the "isolates
> problem" by using an artificial world tree (where all languages
> are somehow included; the very beautiful tree is used in the press
> release
> <https://www.mpg.de/25723124/1114-evan-enduring-patterns-in-the-world-s-languages-150495-x>).
> Are Verkerk et al. pursuing a different goal? That is not really
> clear to me.
>
> I find the notion of an artificial world tree profoundly strange,
> much stranger than the hypothetical scenario of 100 isolates on
> remote islands. But maybe it is needed, because the goal of the
> enterprise is somehow different (along Bickel's lines)? So I like
> the imagined "isolated isolates" scenario also because it
> clarifies what I'm interested in.
>
> (And isn't Trudgill's idea that isolates are somehow "exotic" very
> speculative? Shcherbakova et al. 2023 have not provided strong
> evidence against the idea, but they simply did not find evidence
> in favour of it.)
>
> One last point: Yes, all isolates are survivors from some larger
> family, but why is that relevant? Languages may have existed for
> half a million years or longer, and we know almost nothing about
> that deep past. Most of the currently existing families probably
> had more branches in earlier times, and the protolanguages we
> reconstruct may or may not have been isolates themselves. We
> cannot tell, but I don't see why we would need to know.
>
> Best,
>
> Martin
>
> On 21.11.25 07:07, Juergen Bohnemeyer via Lingtyp wrote:
>
> Dear all — Here’s a quick explanation of why the assumption of
> an “isolated isolate” is profoundly strange:
>
> Leaving aside sign languages, constructed languages, and
> artificial languages, nobody seems to entertain the
> possibility that languages have emerged spontaneously out of
> something that we wouldn’t consider a language itself over the
> last few thousands of years. In other words, the languages we
> consider isolates are without exception lone survivors; but
> they did descend from ancestors which are often lost and
> unknown, and these ancestors biased the offshoot's properties
> by dint of inheritance/transmission.
>
> The reason isolates are interesting from a sampling
> perspective is that they may represent entire genera or
> families without forcing us to pick a member. But being an
> isolate does not mean being free of phylogenetic bias. On the
> contrary: isolates of unknown descend are actually
> particularly problematic in the sense that they are shaped by
> biases that we have no way of identifying directly since the
> biasing ancestors have been lost to time.
>
> As to contact. Languages that are not in contact with other
> languages over long stretches of time are extremely rare and
> unusual. In fact, as I’m sure everyone here is aware, such
> languages have been plausibly argued to tend to evolve exotic
> properties as a result of their isolation (Lupyan & Dale 2010;
> Trudgill 2011), although this is controversial (Shcherbakova
> et al. 2023). In any case, I would certainly not want to make
> such languages the basis for causal inference in typology.
>
> But it gets a lot worse. The “isolated isolate” interpretation
> doesn’t just require us to think of a language that isn’t
> currently in contact with any other language. We would have to
> assume a language that has *never* come into contact with any
> other language at any point in its history (at least not
> long/intensively enough to change as a result of it). I’m
> seriously uncertain whether such a language has ever existed
> on this planet.
>
> Here’s an analogy from quantum mechanics: Schrödinger’s and
> Heisenberg’s equations are mathematical models that describe
> the experimentally observed behavior of elementary particles
> under various conditions. The particle and the wave
> interpretation are interpretations that we use to make sense
> of these mathematical models. We find these models useful
> because most of us don’t think in mathematical equations (not
> even theoretical physicists, it would seem). But if we push
> these interpretations beyond a certain point, they break down.
> To begin with, we can’t think of something simultaneously as a
> wave and as a particle.
>
> In the same way, we can mathematically describe the influence
> phylogeny and areality exert on the probability of a
> particular language having certain properties. The “isolated
> isolate” interpretation is just that - an interpretation of
> the statistical models; but, as I tried to show above, it runs
> into absurdities rather more quickly than the particle and
> wave interpretations in quantum mechanics.
>
> Best — Juergen
>
> G. Lupyan, R. Dale, Language structure is partly determined by
> social structure. PLOS ONE5, e8559 (2010).
>
> O. Shcherbakova, S. M. Michaelis, H. J. Haynie, et al.
> Societies of strangers do not speak less complex languages.
> /Scientific Advances /9, eadf7704 (2023).
>
> P. Trudgill, /Sociolinguistic Typology: Social Determinants of
> Linguistic Complexity /(OxfordUniv. Press, 2011).
>
> Juergen Bohnemeyer (He/Him)
> Professor, Department of Linguistics
> University at Buffalo
>
> Office: 642 Baldy Hall, UB North Campus
> Mailing address: 609 Baldy Hall, Buffalo, NY 14260
> Phone: (716) 645 0127
> Fax: (716) 645 3825
> Email: _jb77 at buffalo.edu <mailto:jb77 at buffalo.edu>_
> Web: _http://www.acsu.buffalo.edu/~jb77/
> <http://www.acsu.buffalo.edu/~jb77/>_
>
> Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom
> (Meeting ID 585 520 2411; Passcode Hoorheh)
>
> There’s A Crack In Everything - That’s How The Light Gets In
> (Leonard Cohen)
>
> --
>
> *From: *Lingtyp <lingtyp-bounces at listserv.linguistlist.org>
> <mailto:lingtyp-bounces at listserv.linguistlist.org> on behalf
> of Matías Guzmán Naranjo via Lingtyp
> <lingtyp at listserv.linguistlist.org>
> <mailto:lingtyp at listserv.linguistlist.org>
> *Date: *Thursday, November 20, 2025 at 04:01
> *To: *lingtyp at listserv.linguistlist.org
> <mailto:lingtyp at listserv.linguistlist.org>
> <lingtyp at listserv.linguistlist.org>
> <mailto:lingtyp at listserv.linguistlist.org>
> *Subject: *Re: [Lingtyp] Reporting cross-linguistic frequencies
>
> I'll jump in with some thoughts.
>
>
> - Dryer's method and ours aim at doing basically the same thing:
> quantifying what's "left" after removing genetic and areal bias.
>
> - Whether you should call them proportions or adjusted
> frequencies...
> I'm not sure that it matters that much? As long as you
> understand how
> they were calculated...
>
> - How you want to interpret this "what's left" is debatable, I
> guess,
> but I don't think I agree with Jürgen. As far as I can tell it
> should be
> compatible with something along the lines of an "isolated
> isolate" as
> described by Martin. You can also see them as 'universal'
> preferences
> (more or less the same thing?).
>
> - "the probability of a random language having a certain property
> depends on (or is influenced by, or varies with, etc.) it
> being related
> to certain other languages, or being spoken (or signed) in a
> particular
> area" -> In our approach we assumes that the probability of a
> language L
> having some feature value F depends on three things: 1) its
> relatedness
> to other languages, 2) its contact to other languages, 3) some
> universal
> preference for F. Kind of the point of what we do is that we
> try to
> estimate each of these factors. [We can add more factors and more
> structure, but that's the most basic model]
>
> - You can quantify the contribution of the phylogenetic
> component and
> the areal component(s) with our techniques, but this is a bit
> tricky
> because there is unavoidable overlap in the information each one
> contains. These measures also have a different meaning than
> the adjusted
> frequency and can't be used as a replacement for them, you can
> use them
> in addition to.
>
>
> Matías
>
>
>
> El 20/11/25 a las 9:36, Omri Amiraz via Lingtyp escribió:
> > Dear all,
> > I agree with Ian that, in addition to genealogical and areal
> biases,
> > the very question of what counts as a language versus a
> dialect is
> > partly subjective. This makes actual frequencies even more
> > problematic, since we would obtain different results
> depending on
> > whether we treat Wu Chinese as one language or as thirty
> separate
> > languages, as Ian pointed out.
> > Juergen wrote: "We can empirically assess the extent to
> which the
> > probability of a random language having a certain property
> depends on
> > (or is influenced by, or varies with, etc.) it being related to
> > certain other languages, or being spoken (or signed) in a
> particular
> > area."
> >
> > I wonder whether it might be useful to have a measure of the
> > genealogical and areal spread of a feature, essentially
> quantifying
> > how broadly it is distributed across families and regions in the
> > present-day world. Such a measure might be more
> straightforward to
> > interpret than an adjusted frequency/probability, since it
> is not
> > clear whether the described population is a hypothetical set of
> > isolated isolates or something else.
> >
> > Is anyone aware of an existing metric that captures
> genealogical or
> > areal spread in this way?
> >
> > Best,
> > Omri
> >
> > _______________________________________________
> > Lingtyp mailing list
> > Lingtyp at listserv.linguistlist.org
> <mailto:Lingtyp at listserv.linguistlist.org>
> >
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C88b1df86321b4cb12f9f08de28135c96%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638992260962407959%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uY52%2BPtTVyzNB0LIowvZ0UzKWB6MWLR%2BG62V70JtNGE%3D&reserved=0
> <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> <mailto:Lingtyp at listserv.linguistlist.org>
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C88b1df86321b4cb12f9f08de28135c96%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638992260962443120%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=X%2F1JMgRNS%2Bn0ZlGa7pPdsJWJBoJy%2BYJt6bHWktCMeRc%3D&reserved=0
> <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org <mailto:Lingtyp at listserv.linguistlist.org>
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>
> --
> Martin Haspelmath
> Max Planck Institute for Evolutionary Anthropology
> Deutscher Platz 6
> D-04103 Leipzig
> https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
>
> ,
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
>
> --
> Peter Arkadiev, PhD Habil.
> https://peterarkadiev.github.io/
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
More information about the Lingtyp
mailing list