[Lingtyp] Reporting cross-linguistic frequencies

Stela MANOVA manova.stela at gmail.com
Sun Nov 23 14:40:56 UTC 2025


Dear all,

A few additional things.

First, please do not worry about not being “math-gifted”. Very few people are (I mean gifted above the average — giftedness has degrees). Mathematical thinking develops with time and exposure.

Second, there is no need to panic or take decisions in a hurry. Somebody mentioned quantum physics: Einstein’s Nobel Prize (1921, awarded 1922) was for the photoelectric effect — a particle-like behaviour of light; a century later, the 2022 Nobel Prize honoured work on quantum entanglement — an explicitly non-classical, wave–particle phenomenon. In other words, physicists needed a century to clarify these foundational questions in a way robust enough for a Nobel. It would be naïve to expect linguists to solve a task of that level (because typology is a task of that level) in a few days or months.

Third, as we know from filmmaking, form (particles) and meaning (waves) can be separated. It is easier to work with forms because they are observable — in the sense that “observable” was used in physics when particles and waves were debated. However, if it is easier to work with forms, this does not mean that we should neglect meaning — not at all — only that we should not link meaning too tightly to form (cf. entanglement, a quantum-mechanical phenomenon where things remain connected regardless of distance).

It seems to me that there are enough Russian linguists who have been working seriously on semantics for years; not exactly on meaning separated from form, but they certainly know enough to adjust and initiate a research paradigm (of course, I am not prescribing anything — this is just an opinion). The Italian linguist Andrea Moro has also made progress on this issue. And yes, I know he is a generative linguist, but for good science the theoretical framework does not matter. It can hinder, but it cannot spoil. (I learned this when I reanalyzed work by his PhD supervisor Guglielmo Cinque in The linear order of elements in prominent linguistic sequences: Deriving Tns–Asp–Mood orders and Greenberg’s Universal 20 with n-grams, lingbuzz/006082.)

If “quantum physics” translates into “teleportation”, then for linguistics the corresponding word should be “telepathy”. See Andrea Moro’s work on silent language (the brain’s processing of language without spoken output). He is only at the beginning, but in my view he has begun the journey. Thus, there is so much to do. I wrote to encourage him, but who am I in the whole story — a Bulgarian who is not even allowed to do linguistics — so please pay attention and try to support him if you can.

Then, future work for psycholinguists: LLMs compose and parse in different ways: for instance, for metalanguage they produce meta+language in composition but metal-anguage in parsing. I flagged this problem in Modeling Language Without Language: A ChatGPT Lesson for Language Research (lingbuzz/008998) and plan to explore it further in relation to AI hallucinations. It would be helpful if psycholinguists with proper laboratory access could examine this experimentally. I cannot do this here because my dear colleagues in Vienna have done everything possible and impossible to keep me away from psycholinguistic equipment. So dangerous am I!

As for linguistic research based on form, I think we should concentrate on the fixedness of co-occurrence of forms. It is amusing that when computer scientists use deep nets to detect fixed sequences of tokens (because this is what they actually do), linguists often speak of “competition of units” and explain the results in terms of probability. The two perspectives are not equivalent. Deep nets reveal stable linear co-occurrence patterns, while linguistic explanations tend to dissolve these patterns into abstract probability distributions. It may be time to realign our terminology with the actual behaviour of the systems we study.

There is also much to do in fieldwork. Things could go in two directions:

(a) deepening existing descriptions, or

(b) providing detailed records (even without grammatical annotation) of understudied languages, putting the recordings online, and using LLMs “for free”, so to speak, for language generation.

I think these two strategies will speed up language documentation considerably. However, it is important to understand that we need rich existing data, not artificially constructed (mini) languages. As I have already warned, analogizing with imaginary entities is dangerous. In mathematics we do have imaginary numbers, but they are not the central set we work with.

I address the issue of data volumes in LLMs in my paper Modeling Language Without Language: A ChatGPT Lesson for Language Research (lingbuzz/008998). For our descriptions and analyses to hold, we need big volumes of data because language is neither an ordered nor a regular system in the mathematical sense.

Please do not hesitate to ask me for explanations if something is unclear.

I hope the above — virtually my first thoughts — helps calm the community a little, and shows that there is a future for linguistics after LLMs. And the best future would be one with LLMs.

All the best,

Stela


> On 21.11.2025, at 16:04, Juergen Bohnemeyer via Lingtyp <lingtyp at listserv.linguistlist.org> wrote:
> 
> Dear Peter — I’m a massive fan of corpus-based typology. More broadly, there is no question in my mind that we should, and must, eventually move from secondary data typology to primary data typology. Nobody seems to deny that secondary data typology is fraught with too many problematic idealizations: in particular, it reduces entire languages to single observations, and it suffers from incomparable decisions on what is treated as a language in different parts of the world. 
> 
> (The second problem is closely related to, but not entirely identical with, the countability problem Ian Joo mentions. The fact that language is a count noun is a powerful illustration of how ordinary language can frame reality in ways that may impede scientific progress if it goes unchecked, as Whorf pointed out. However, actually counting languages is not the issue for regression-based modeling, since regression models don’t operate on counts. But the question whether what is treated as an observation (i.e., a language) is uniform across the sample is of course very much a concern for the validity of sampling-based and regression-based modeling alike.)
> 
> There is a broader answer to your question, though: as a matter of course, when confronting the causal inference problem in typology (i.e., when hunting for the causal forces that shape languages), we must consider every source of evidence that we can get our hands on.  Aside from corpus-based typology, this includes field-based psycholinguistics and the toolkit of evolutionary linguistics, including simulations and miniature artificial language experiments. 
> 
> Let me also suggest a distinction between methods that are primarily geared toward the discovery of typological distributions and the examination of their statistical properties and methods than can be used to test hypotheses of causal inference (i.e., explanatory hypotheses). Experimental research such as what I just mentioned has its uses primarily for testing explanatory hypotheses. Corpus-based research can have both functions. But if we want to use corpora to discover typological distributions, we’ll need very large parallax corpus databases. As are being developed now. 
> 
> Best — Juergen
> 
> 
> 
> Juergen Bohnemeyer (He/Him)
> Professor, Department of Linguistics
> University at Buffalo 
> 
> Office: 642 Baldy Hall, UB North Campus
> Mailing address: 609 Baldy Hall, Buffalo, NY 14260 
> Phone: (716) 645 0127 
> Fax: (716) 645 3825
> Email: jb77 at buffalo.edu <mailto:jb77 at buffalo.edu>
> Web: http://www.acsu.buffalo.edu/~jb77/ 
> 
> Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh) 
> 
> There’s A Crack In Everything - That’s How The Light Gets In 
> (Leonard Cohen)  
> -- 
>  
> From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> on behalf of Peter Arkadiev via Lingtyp <lingtyp at listserv.linguistlist.org>
> Date: Friday, November 21, 2025 at 05:59
> To: Martin Haspelmath <martin_haspelmath at eva.mpg.de>, Linguistic Typology <lingtyp at listserv.linguistlist.org>
> Subject: Re: [Lingtyp] Reporting cross-linguistic frequencies
> 
> Dear Martin, dear all,
>  
> I am starting to wonder whether statistical analysis of a language sample is at all a suitable method for "detecting universal tendencies that are caused by universal/non-historical factors" (Martin's formulation). Given that there is no consensus as for how to overcome genealogical and areal biases and even whether those biases must be overcome at all and what trying to overcome them actually gets us (apart from getting some of us high-profile publications with ever more complicated mathematical apparatus which others among us struggle to understand and cannot evaluate; not being in any way a "mathematically-gifted person", to borrow Stela's expression, I belong to the latter group), the whole enterprise does not appear to be very productive. What if the more appropriate method, at least if purported functional factors are being concerned, is the one employed by John Hawkins, Natalia Levshina and some others, i.e. to combine experimental research on production / processing with a quantitative study of variation in corpora across a small number of sufficiently distinct languages? If we can show that certain well-defined factors are operative in language processing and result in skewed distributions in corpora ultimately translatable into tendencies of diachronic change, and moreover are able to corroborate these results by similarly skewed distributions of variables in reasonably designed cross-linguistic samples, then what else do we need? In any case, as has been repeatedly stated many times, even if we find that in a certain language sample, however well-designed, a certain variable shows a clearly skewed distribution of, say 80% vs 20%, nothing follows from this in terms of "universal preferences" unless we are able to independently show that the more frequent value is in some or other way "preferred" in processing / production etc. I am sorry if the above is self-evident or naive.
>  
> Best regards,
>  
> Peter
>  
>  
> ----------------
> Кому: lingtyp at listserv.linguistlist.org (lingtyp at listserv.linguistlist.org);
> Тема: [Lingtyp] Reporting cross-linguistic frequencies;
> 21.11.2025, 10:19, "Martin Haspelmath via Lingtyp" <lingtyp at listserv.linguistlist.org>:
> Thanks, Jürgen! I like the "wave vs. particle" analogy, because these concrete expressions help us make sense of what seems to be going on (given the experimental results).
> 
> In worldwide comparative linguistics, we also want to make sense of what is going on, but it seems to me that we need analogies not only for interpreting results, but also for understanding what we are aiming for. For me, "removing areal and genealogical/phylogenetic bias" has the aim of detecting universal tendencies that are caused by universal/non-historical factors.
> 
> I would think that on the imagined concrete scenario of a sample of isolated isolates (e.g. 100 languages that have long existed on isolated islands, maybe of the Rapanui type), looking at these 100 isolates should give the same results as looking at 100 sample languages from larger families that have been shaped also by contact.
> 
> Are there reasons to doubt this? If not, then we can take the "isolated isolates" scenario simply as a way of illustrating our goals in concrete terms (somewhat like "wave" and "particle" serve as concrete illustrations). 
> 
> But maybe the imagined scenario (which is not an "assumption"!!) is somehow problematic, because the goals of our enterprise are DIFFERENT. In Bickel's (2007) paper (LiTy 11), which has been widely cited, the idea seems to be that looking for "history-free" tendencies is somehow an obsolete goal.
> 
> Some people have suggested that in identifying universal trends, one MUST take into account genealogies, and isolates are problematic because they are not part of any genealogy. This is because we should not look primarily at languages, but at *transitions* (changes from one type to another). If I understood Verkerk et al. (2025) correctly, they solved the "isolates problem" by using an artificial world tree (where all languages are somehow included; the very beautiful tree is used in the press release <https://www.mpg.de/25723124/1114-evan-enduring-patterns-in-the-world-s-languages-150495-x>). Are Verkerk et al. pursuing a different goal? That is not really clear to me.
> 
> I find the notion of an artificial world tree profoundly strange, much stranger than the hypothetical scenario of 100 isolates on remote islands. But maybe it is needed, because the goal of the enterprise is somehow different (along Bickel's lines)? So I like the imagined "isolated isolates" scenario also because it clarifies what I'm interested in.
> 
> (And isn't Trudgill's idea that isolates are somehow "exotic" very speculative? Shcherbakova et al. 2023 have not provided strong evidence against the idea, but they simply did not find evidence in favour of it.)
> 
> One last point: Yes, all isolates are survivors from some larger family, but why is that relevant? Languages may have existed for half a million years or longer, and we know almost nothing about that deep past. Most of the currently existing families probably had more branches in earlier times, and the protolanguages we reconstruct may or may not have been isolates themselves. We cannot tell, but I don't see why we would need to know.
> 
> Best,
> 
> Martin
> 
>  
> On 21.11.25 07:07, Juergen Bohnemeyer via Lingtyp wrote:
> Dear all — Here’s a quick explanation of why the assumption of an “isolated isolate” is profoundly strange: 
> 
> Leaving aside sign languages, constructed languages, and artificial languages, nobody seems to entertain the possibility that languages have emerged spontaneously out of something that we wouldn’t consider a language itself over the last few thousands of years. In other words, the languages we consider isolates are without exception lone survivors; but they did descend from  ancestors which are often lost and unknown, and these ancestors biased the offshoot's properties by dint of inheritance/transmission.
> 
> The reason isolates are interesting from a sampling perspective is that they may represent entire genera or families without forcing us to pick a member. But being an isolate does not mean being free of phylogenetic bias. On the contrary: isolates of unknown descend are actually particularly problematic in the sense that they are shaped by biases that we have no way of identifying directly since the biasing ancestors have been lost to time.
> 
> As to contact. Languages that are not in contact with other languages over long stretches of time are extremely rare and unusual. In fact, as I’m sure everyone here is aware, such languages have been plausibly argued to tend to evolve exotic properties as a result of their isolation (Lupyan & Dale 2010; Trudgill 2011), although this is controversial (Shcherbakova et al. 2023). In any case, I would certainly not want to make such languages the basis for causal inference in typology.
> 
> But it gets a lot worse. The “isolated isolate” interpretation doesn’t just require us to think of a language that isn’t currently in contact with any other language. We would have to assume a language that has never​ come into contact with any other language at any point in its history (at least not long/intensively enough to change as a result of it). I’m seriously uncertain whether such a language has ever existed on this planet. 
> 
> Here’s an analogy from quantum mechanics: Schrödinger’s and Heisenberg’s equations are mathematical models that describe the experimentally observed behavior of elementary particles under various conditions. The particle and the wave interpretation are interpretations that we use to make sense of these mathematical models. We find these models useful because most of us don’t think in mathematical equations (not even theoretical physicists, it would seem). But if we push these interpretations beyond a certain point, they break down. To begin with, we can’t think of something simultaneously as a wave and as a particle. 
> 
> In the same way, we can mathematically describe the influence phylogeny and areality exert on the probability of a particular language having certain properties. The “isolated isolate” interpretation is just that - an interpretation of the statistical models; but, as I tried to show above, it runs into absurdities rather more quickly than the particle and wave interpretations in quantum mechanics. 
> 
> Best — Juergen
> 
> G. Lupyan, R. Dale, Language structure is partly determined by social structure. PLOS ONE5, e8559 (2010).
> 
> O. Shcherbakova, S. M. Michaelis, H. J. Haynie, et al. Societies of strangers do not speak less complex languages. Scientific Advances 9, eadf7704 (2023).
> 
> P. Trudgill, Sociolinguistic Typology: Social Determinants of Linguistic Complexity (OxfordUniv. Press, 2011).
> 
> Juergen Bohnemeyer (He/Him)
> Professor, Department of Linguistics
> University at Buffalo 
> 
> Office: 642 Baldy Hall, UB North Campus
> Mailing address: 609 Baldy Hall, Buffalo, NY 14260 
> Phone: (716) 645 0127 
> Fax: (716) 645 3825
> Email: jb77 at buffalo.edu <mailto:jb77 at buffalo.edu>
> Web: http://www.acsu.buffalo.edu/~jb77/ 
> 
> Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh) 
> 
> There’s A Crack In Everything - That’s How The Light Gets In 
> (Leonard Cohen)  
> -- 
>  
> From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> <mailto:lingtyp-bounces at listserv.linguistlist.org> on behalf of Matías Guzmán Naranjo via Lingtyp <lingtyp at listserv.linguistlist.org> <mailto:lingtyp at listserv.linguistlist.org>
> Date: Thursday, November 20, 2025 at 04:01
> To: lingtyp at listserv.linguistlist.org <mailto:lingtyp at listserv.linguistlist.org> <lingtyp at listserv.linguistlist.org> <mailto:lingtyp at listserv.linguistlist.org>
> Subject: Re: [Lingtyp] Reporting cross-linguistic frequencies
> 
> I'll jump in with some thoughts.
> 
> 
> - Dryer's method and ours aim at doing basically the same thing:
> quantifying what's "left" after removing genetic and areal bias.
> 
> - Whether you should call them proportions or adjusted frequencies...
> I'm not sure that it matters that much? As long as you understand how
> they were calculated...
> 
> - How you want to interpret this "what's left" is debatable, I guess,
> but I don't think I agree with Jürgen. As far as I can tell it should be
> compatible with something along the lines of an "isolated isolate" as
> described by Martin. You can also see them as 'universal' preferences
> (more or less the same thing?).
> 
> - "the probability of a random language having a certain property
> depends on (or is influenced by, or varies with, etc.) it being related
> to certain other languages, or being  spoken (or signed) in a particular
> area" -> In our approach we assumes that the probability of a language L
> having some feature value F depends on three things: 1) its relatedness
> to other languages, 2) its contact to other languages, 3) some universal
> preference for F. Kind of the point of what we do is that we try to
> estimate each of these factors. [We can add more factors and more
> structure, but that's the most basic model]
> 
> - You can quantify the contribution of the phylogenetic component and
> the areal component(s) with our techniques, but this is a bit tricky
> because there is unavoidable overlap in the information each one
> contains. These measures also have a different meaning than the adjusted
> frequency and can't be used as a replacement for them, you can use them
> in addition to.
> 
> 
> Matías
> 
> 
> 
> El 20/11/25 a las 9:36, Omri Amiraz via Lingtyp escribió:
> > Dear all,
> > I agree with Ian that, in addition to genealogical and areal biases,
> > the very question of what counts as a language versus a dialect is
> > partly subjective. This makes actual frequencies even more
> > problematic, since we would obtain different results depending on
> > whether we treat Wu Chinese as one language or as thirty separate
> > languages, as Ian pointed out.
> > Juergen wrote: "We can empirically assess the extent to which the
> > probability of a random language having a certain property depends on
> > (or is influenced by, or varies with, etc.) it being related to
> > certain other languages, or being  spoken (or signed) in a particular
> > area."
> >
> > I wonder whether it might be useful to have a measure of the
> > genealogical and areal spread of a feature, essentially quantifying
> > how broadly it is distributed across families and regions in the
> > present-day world. Such a measure might be more straightforward to
> > interpret than an adjusted frequency/probability, since it is not
> > clear whether the described population is a hypothetical set of
> > isolated isolates or something else.
> >
> > Is anyone aware of an existing metric that captures genealogical or
> > areal spread in this way?
> >
> > Best,
> > Omri
> >
> > _______________________________________________
> > Lingtyp mailing list
> > Lingtyp at listserv.linguistlist.org <mailto:Lingtyp at listserv.linguistlist.org>
> > https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C88b1df86321b4cb12f9f08de28135c96%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638992260962407959%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uY52%2BPtTVyzNB0LIowvZ0UzKWB6MWLR%2BG62V70JtNGE%3D&reserved=0 <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org <mailto:Lingtyp at listserv.linguistlist.org>
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flistserv.linguistlist.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flingtyp&data=05%7C02%7Cjb77%40buffalo.edu%7C88b1df86321b4cb12f9f08de28135c96%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638992260962443120%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=X%2F1JMgRNS%2Bn0ZlGa7pPdsJWJBoJy%2BYJt6bHWktCMeRc%3D&reserved=0 <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
> 
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org <mailto:Lingtyp at listserv.linguistlist.org>
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
> -- 
> Martin Haspelmath
> Max Planck Institute for Evolutionary Anthropology
> Deutscher Platz 6
> D-04103 Leipzig
> https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
> ,
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org <mailto:Lingtyp at listserv.linguistlist.org>
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
> 
>  
>  
> -- 
> Peter Arkadiev, PhD Habil.
> https://peterarkadiev.github.io/
>  
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20251123/3990ca92/attachment.htm>


More information about the Lingtyp mailing list