[Lingtyp] Reporting cross-linguistic frequencies

Juergen Bohnemeyer jb77 at buffalo.edu
Mon Nov 24 20:11:42 UTC 2025


Dear Bill —  Reference grammars and dictionaries aim for generalizations over the linguistic behavior of speech communities. The generalizations are based on primary data, but are themselves secondary data. And when a typologist uses an existing description or documentation and bases on it a categorization of the language as exhibiting a certain type as opposed to others, they treat the description as secondary data.

Take for example Namboodiripad (2017) on word order in Malayalam. WALS treats Malayalam as SOV. Grambank treats word order in Malayalam as verb-final and non-fixed. Whatever information the coders had at their disposal when making these calls is thereby reduced to secondary data. WALS and Grambank are reducing the amount of information so much that the behavior of individual speakers becomes no longer visible. That’s secondary data.

Namboodiripad showed in her dissertation that word order flexibility is itself variable, decreasing in younger speakers possibly as a result of contact.  Those are generalizations based on primary data. The generalizations themselves are again secondary data, but data that preserves information about variation, rather than to reduce the language to a single datapoint. Levshina et al. (2023) demonstrate what a primary-data-based typology of word order might look like.

Best — Juergen

Levshina, N., Namboodiripad, S., Allassonnière-Tang, M., Kramer, M., Talamo, L., Verkerk, A.,Wilmoth, S., Garrido Rodriguez, G., Gupton, T., Kidd, E., Liu, Z., Naccarato, C. Nordlinger, R.,Panova., N., Stoynova, N. (2023). Why we need a gradient approach to word order. Linguistics 61(4): 825-883.

Namboodiripad, S. (2017). An Experimental Approach to Variation and Variability in Constituent Order. Doctoral Dissertation, University of California, San Diego.



Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo

Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/

Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)

There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)

--



From: William Croft <wacroft at icloud.com>
Date: Monday, November 24, 2025 at 14:23
To: Juergen Bohnemeyer <jb77 at buffalo.edu>, Linguistic Typology <lingtyp at listserv.linguistlist.org>
Subject: Re: [Lingtyp] Reporting cross-linguistic frequencies

You don't often get email from wacroft at icloud.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Dear Juergen,

   Thanks very much for your clarifications. I think that a major part of my misunderstanding your emails is down to not understanding what you are describing as “secondary data”. I assumed you meant reference grammars etc. produced by documentary/descriptive linguists, that typologists use for constructing cross-linguistic generalizations and explanations for those generalizations. Analogously, with respect to semantic typology, where you said “secondary data is not available”, I misunderstood you to be using “secondary data” to refer to dictionaries produced by documentary/descriptive linguists used by some typologists in semantic typology. What are you referring to by “secondary data”?

   Regarding my last comment, I was reacting to this statement in your 25 Nov email:

Nobody seems to deny that secondary data typology is fraught with too many problematic idealizations: in particular, it reduces entire languages to single observations, and it suffers from incomparable decisions on what is treated as a language in different parts of the world.

   This passage made me think that you are suggesting getting rid of “secondary data” in typology because it has ’too many problematic idealizations’. Again, though, I think I am misunderstanding what you mean by “secondary data”.

   See also your comment in your latest email:

Reducing languages to single observations was until now a necessary idealization, as happens in the history of science over and over again.

   I am still not sure what sort of typology you are referring to. The sort of typology I am most familiar with (and most respect) uses information from reference grammars and other descriptive materials to both construct cross-linguistic generalizations and find clues for possible explanations of the generalizations from information in the reference grammar on variation, subtle differences in form and use, etc. It doesn’t reduce languages to single observations. But maybe this isn’t what you are referring to as “secondary data typology”?

Apologies and best wishes,
Bill

On Nov 24, 2025, at 11:38 AM, Juergen Bohnemeyer <jb77 at buffalo.edu> wrote:

Dear all — There’s quite a bit of distortion here of what I said, unintentionally I’m sure. Still, I feel I need to clarify:

First off, I didn’t say “all semantic typology research must use primary data”. I said most does.

Secondly, Bill says "Documentary/descriptive linguists do not just 'abstract away from individual speakers and attribute certain properties to entire linguistic varieties and speech communities'. Their descriptions are based on "primary data", and frequently describe variation, contexts of use, interactional phenomena, social attitudes, socially governed differences in language behavior, etc. These are valuable generalizations about a language as a community entity.” However, I did not in any way, shape, or form suggest that description is based on, or even aims to produce, secondary data. I didn’t in fact comment on practices of language description/documentation at all. What I said is that secondary data typology uses results of language descriptions as secondary data.

And lastly, I have no idea where Bill is taking this from:

"There seems to be a purist view here that some data is perfect ("perfectly natural", "perfectly controlled", or whatever), and other data is so flawed as to be useless (see Juergen’s 25 Nov email below)”

I didn’t use the words “perfect” and “flawed” at all. What I was commenting on is that secondary data typology, by virtue of reducing entire languages to single observations, ignores vast amounts of information about them. In the past, this was inevitable because there was no reasonable alternative. This is now slowly changing, largely as a result of technological advancements. So, as a result, if we can do better, we will, unless you expect science to stagnate or backslide. At the same time, I’m sure secondary-data typology will remain an important part of the toolkit, particularly as a means of aggregating primary data.

Reducing languages to single observations was until now a necessary idealization, as happens in the history of science over and over again. Consider for example grammaticality judgments: unit recently, syntacticians were basing their generalizations on categorizing sentences dichotomously as grammatical or ungrammatical. Now the field is slowly changing to open itself up to more nuanced evidence from psycholinguistics and corpus linguistics. I see the role of primary data in typology as a rather close analogy to that.

Best — Juergen


Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo

Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/

Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)

There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)
--



From: William Croft <wacroft at icloud.com>
Date: Monday, November 24, 2025 at 11:52
To: Juergen Bohnemeyer <jb77 at buffalo.edu>, Linguistic Typology <lingtyp at listserv.linguistlist.org>
Subject: Re: [Lingtyp] Reporting cross-linguistic frequencies

You don't often get email from wacroft at icloud.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Some comments on Juergen's email, starting from the end.

It is not the case that all funding is directed towards "secondary data". There are quite a few sources for funding language documentation, not to mention sources for funding experimental psycholinguistics. My impression is that it is very difficult to obtain funding for a typological project based solely on "secondary data", that is, data collected solely from descriptive materials.

It is not the case that all semantic typological research must use "primary data" "out of sheer necessity". See for example the publications of Brown and Witkowski (e.g. Brown 1984), the Database of Semantic Shifts (https://datsemshift.ru/), or a paper from a project I was involved with (Youn et al. 2016). Conversely, it is not the case that phonetic typology is entirely based on "secondary data"; see for example the experimental research described in Ladefoged and Maddieson (1996).

Himmelmann (1998) has a more nuanced description of "primary data" and its relation to "secondary data". For example: 'There are generally three components to each document (piece of data), viz. the “raw” data in various forms of representation (transcription, tape, and/or video), a translation (word-by-word/interlinear and free), and a commentary providing additional information as to recording circumstance, linguistic and cultural peculiarities associated with the data segment, comments by native speakers cooperating in the transcription and translation of the segment, problems encountered in transcribing and translating, further data elicited in connection with the segment, etc. In short, everything that happened during recording, transcribing, and translating the data (and eliciting, in the case of elicited data)' (pp. 169-170). Note that 'document (piece of data)' includes transcription, IMT and translation, the elements of a text corpus. The main problem with traditional text corpora is that they are incomplete: often lacking audio or video, having only the minimal presence of the third component (the metadata captures only a fraction of it), and the restricted selection of discourse types (Himmelmann 1998:166ff). But they are not worthless. For many languages, it is all that we have of any form of discourse.

Finally, a language as a community entity is more than just a set of individual speakers' productions. There is a social dimension to language and language use (not to mention also a cognitive dimension of speaker intentions in social interactions involving language). Documentary/descriptive linguists do not just 'abstract away from individual speakers and attribute certain properties to entire linguistic varieties and speech communities'. Their descriptions are based on "primary data", and frequently describe variation, contexts of use, interactional phenomena, social attitudes, socially governed differences in language behavior, etc. These are valuable generalizations about a language as a community entity.

There seems to be a purist view here that some data is perfect ("perfectly natural", "perfectly controlled", or whatever), and other data is so flawed as to be useless (see Juergen’s 25 Nov email below). No data is perfect, and all data is useful, even if it must be taken with a grain of salt.

Bill

Brown, Cecil H. 1984. Language and Living Things. Rutgers: Rutgers University Press.

Ladefoged, Peter & Ian Maddieson. 1996. The sounds of the world’s languages. Oxford: Basil Blackwell.

Youn, Hyejin, Logan Sutton, Eric Smith, Cristopher Moore, Jon F. Wilkins, Ian Maddieson, William Croft and Tanmoy Bhattacharya. 2016. On the universal structure of human lexical semantics. Proceedings of the National Academy of Sciences 113(7).1766-71.

On Nov 24, 2025, at 8:04 AM, Juergen Bohnemeyer via Lingtyp <lingtyp at listserv.linguistlist.org> wrote:

Sorry, just to clarify further: by “generalizations over languages”, I didn’t mean typological generalizations; I meant descriptive statements about individual languages. Those are generalizations in the sense that they abstract away from individual speakers and attribute certain properties to entire linguistic varieties or speech communities. That’s the nature of secondary data in my view. — Best — Juergen

Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo

Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/

Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)

There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)
--



From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> on behalf of Juergen Bohnemeyer via Lingtyp <lingtyp at listserv.linguistlist.org>
Date: Monday, November 24, 2025 at 09:47
To: Mira Ariel <mariel at tauex.tau.ac.il>, Martin Haspelmath <martin_haspelmath at eva.mpg.de>, Peter Arkadiev <peterarkadiev at yandex.ru>, Linguistic Typology <lingtyp at listserv.linguistlist.org>
Subject: Re: [Lingtyp] Reporting cross-linguistic frequencies

Dear all — I’m treating as primary data anything that consists of the speech, or judgments (although those to me have a less “vivid” quality as data), of individual speakers (and analogously for sign language). As opposed to generalizations over languages — that’s what I mean by secondary data. I’m well aware that corpus data has an in-between status. Perhaps rather than to say that is primary data, it would be more appropriate to say that it can be used, to some extent, like primary data.

Primary data can be the result of spontaneous observation, can consist of recordings of what Himmelmann (1998) calls ‘staged’ discourses, and can be elicited or collected experimentally. I see experimentation and elicitation as cluster concepts that form a multidimensional continuum (as discussed in my upcoming book on Semantic research: From data to analysis, due out with CUP in January).

Today, almost all of morphosyntactic typology and the bulk of phonetic typology is based on secondary data. In contrast, semantic typology (my primary focus) mostly utilizes primary data, out of sheer necessity, since secondary data is not available.

And to respond to Martin: I really didn’t mean to suggest that we drop secondary data typology right this minute ;-) (I’m actually myself up to my ears in Grambank data these days.) What I’m envisioning is a gradual shift in emphasis over the next couple of decades, especially when it comes to megaprojects (by typological standards) such as Grambank. Creating the resources needed to get us into striking distance for primary data typology on grammar will require a vast effort, so at some point, typologists and funders will have to make decisions on which basket they want to place those big eggs (sorry, mixing metaphors again) in, continuing to pour everything into the secondary data basket or gradually shifting emphasis toward funding more primary data projects.

Best — Juergen

Himmelmann, N. P. (1998). Documentary and descriptive linguistics. Linguistics 36:161-195.

Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo

Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/

Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)

There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)
--



From: Mira Ariel <mariel at tauex.tau.ac.il>
Date: Monday, November 24, 2025 at 09:14
To: Martin Haspelmath <martin_haspelmath at eva.mpg.de>, Juergen Bohnemeyer <jb77 at buffalo.edu>, Peter Arkadiev <peterarkadiev at yandex.ru>, Linguistic Typology <lingtyp at listserv.linguistlist.org>
Subject: RE: [Lingtyp] Reporting cross-linguistic frequencies

Hi,

I’m not a typologist, but in semantics/pragmatics research a similar dilemma arises: Corpus data or experimental data? My experience has been that although both have flaws, both can advance our understanding of language. We should just give up on the idea that we could find the one perfect methodology. That said, there’s plenty of room to criticize what one thinks is a flawed methodology, of course.

Best,
Mira

From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> On Behalf Of Martin Haspelmath via Lingtyp
Sent: Sunday, November 23, 2025 11:56 PM
To: Juergen Bohnemeyer <jb77 at buffalo.edu>; Peter Arkadiev <peterarkadiev at yandex.ru>; Linguistic Typology <lingtyp at listserv.linguistlist.org>
Subject: Re: [Lingtyp] Reporting cross-linguistic frequencies


I agree with Peter that the corpus-based methods employed by Hawkins, Wälchli, Cysouw, Levshina and others have been very important, and also with Jürgen that "when confronting the causal inference problem in typology, we must consider every source of evidence that we can get our hands on."

But I don't agree with Peter that "the whole enterprise [of overcoming genealogical and areal biases] does not appear to be very productive", and I don't agree with Jürgen that we "must eventually move from secondary data typology to primary data typology".

I think that the enterprise of controlling for family and contact effects is absolutely necessary, because otherwise we cannot distinguish outcomes of universal/non-historical factors from outcomes of historical events. Peter recognizes this implicitly when he says that we should "combine experimental research ... with a quantitative study of variation in corpora across a small number of sufficiently distinct languages". That's precisely the point: Which languages are "sufficiently distinct"? And hasn't the search for empirical universals been *highly productive* over the last few decades? The recent paper by Verkerk et al. (2025) has found good evidence for most of the empirical universals that had been seriously discussed earlier, so the Greenbergian universals seem to very robust findings compared to many other prestigious claims in linguistics.

And I think that there is no reason to abandon secondary-data typology just because we can also (increasingly) do primary-data typology. Typological comparison can be done at multiple scales and multiple levels of granularity, and it is not clear that we can dispense with any of these levels. For example, we want to do typology of phonological segments (along the lines of the Phoible.org database), or typology of word meanings (lexification typology, cf. https://clics.clld.org/), and for these, it seems that secondary data will not be easily replaced.

Best,

Martin


On 21.11.25 16:04, Juergen Bohnemeyer wrote:
Dear Peter — I’m a massive fan of corpus-based typology. More broadly, there is no question in my mind that we should, and must, eventually move from secondary data typology to primary data typology. Nobody seems to deny that secondary data typology is fraught with too many problematic idealizations: in particular, it reduces entire languages to single observations, and it suffers from incomparable decisions on what is treated as a language in different parts of the world.

(The second problem is closely related to, but not entirely identical with, the countability problem Ian Joo mentions. The fact that language is a count noun is a powerful illustration of how ordinary language can frame reality in ways that may impede scientific progress if it goes unchecked, as Whorf pointed out. However, actually counting languages is not the issue for regression-based modeling, since regression models don’t operate on counts. But the question whether what is treated as an observation (i.e., a language) is uniform across the sample is of course very much a concern for the validity of sampling-based and regression-based modeling alike.)

There is a broader answer to your question, though: as a matter of course, when confronting the causal inference problem in typology (i.e., when hunting for the causal forces that shape languages), we must consider every source of evidence that we can get our hands on.  Aside from corpus-based typology, this includes field-based psycholinguistics and the toolkit of evolutionary linguistics, including simulations and miniature artificial language experiments.

Let me also suggest a distinction between methods that are primarily geared toward the discovery of typological distributions and the examination of their statistical properties and methods than can be used to test hypotheses of causal inference (i.e., explanatory hypotheses). Experimental research such as what I just mentioned has its uses primarily for testing explanatory hypotheses. Corpus-based research can have both functions. But if we want to use corpora to discover typological distributions, we’ll need very large parallax corpus databases. As are being developed now.

Best — Juergen



Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo

Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/

Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)

There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)
--


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20251124/d5397975/attachment-0001.htm>


More information about the Lingtyp mailing list