[Lingtyp] "AI" and linguistics problem sets

Tue Nov 11 04:08:17 UTC 2025

Dear Juergen/All,

First, thanks very much to you and others who have responded both on-list and off so far - it has been fascinating to find that there are such different perspectives on this issue, ranging from “my students simply don’t use AIs to do their work, so I see no problem with continuing to use problem sets in assessments” to “I know that my students will use AIs on problem sets no matter what I tell them, so I only ever give locked-down exams”. And of course, quite a few people hoping - like me - for some sort of middle ground.

I want to respond to Juergen’s question now because I’m afraid that the discussion is getting a bit derailed by thoughts about the nature/features of LLMs and what they are or aren’t doing with text-based language data, which - while interesting - is really not what I wanted to focus on (though I would love to follow a discussion along those lines in another thread).

To try to refocus:

  1.
I’m asking specifically about long-form, dataset-based problems, of the type found here: https://pages.uoregon.edu/tpayne/problem_sets.htm. The reason why I want to focus on them, rather than on some other method of assessment, is because they can be effectively designed to mimic many aspects of a descriptive linguist’s work process (organising large-ish sets of data, identifying gaps, performing distributional analysis, testing different hypotheses, and - yes - reasoning abductively). Depending on various factors, “solving” a dataset-based problem can take a lot of time - and so are traditionally, in my experience, presented as “homework”. Doing so has generally been a good thing, because it gives students the time that most people need to really sit down and work through data, and therefore gives them/us the opportunity to learn concepts and develop skills through extended application and practice (typically more effective than a studied-for and timed exam). For decades already, there has been a sort of arms race against the internet, as good answers for most standard problem sets have long been posted (somewhere…), but through substitution/cloaking (“Language X” “some features may have been altered”...), and/or use of private/unpublished datasets, I trust that most people, like me, have been managing to cope. Until now, when it has become possible to use chatbots to develop solutions to problems for which no human-generated solution exists anywhere on the internet...

  1.
I’m asking specifically about their use as/in assessment, i.e. giving some students more points for a “better” solution than others, and having this contribute to their grade for the class. Leaving aside any pedagogical or ethical concerns around assessment, student performance ranking, etc., it seems to be a fact that - for better or for worse - most universities, and their government accrediting bodies, are going continue to demand this of their linguistics teachers. So the question is whether and how problem sets (or anything similar) can be used in/as assessments, in view of the following:

  1.
LLMs are simply getting much better at solving dataset-based problems of almost any type, and given the performance improvements over the past 12 months or so, I think we need to assume that they will keep getting better. Last year at this time ChatGPT was terrible at working with Salishan languages; you couldn’t even get it to find the verb root, it would just hack away at the stem phonology like a lunatic. Today, it’s not perfect, but it is definitely much, much better. The likely reasons for this are potentially interesting, but in a sense irrelevant. The question is whether there is any type of human-solvable dataset-based problem that an LLM must (by definition, in some sense) continue to find completely intractable, and I think we have to assume that the practical answer, as far as a linguistics teacher/class is concerned, can only be “no”. [And that’s leaving aside my earlier observation that we probably don’t want to be assessing students exclusively on their performance at analysing the small set of putatively “LLM-proof” languages that may for some reason exist in the world.]

  1.
The question in this context is not really whether LLMs can or can’t, as a matter of principle, do the work of distributional analysis in the same way as/as well as a well-trained human linguist. I have my own semi-informed opinions about this, but the question here is whether they can do it well enough to enable Student A, who outsourced their assignment completely to an LLM, to get a mark that is not much worse than that of Student B, who spent their entire weekend doing it by hand (yep - that is what we used to do…). Because if the answer is ever “yes” - and I would argue that, in most cases, it is currently “yes” - then this is no longer a useful or meaningful form of assessment (unless it is modified in some way, hence my question).
  2.

  3.
There are all kinds of cool and interesting tricks for detecting AI use in circulation, and I have indeed used some of them (most recently, text in background-coloured font which, while invisible most users, instructs an LLM to insert a flagword in its response - much like the “watermark” of Juergen’s suggestion). However, two things on this point: one, it’s just as unrealistic to expect a linguistics teacher to stay one step ahead of their students in the latest AI-detection techniques as it is to expect them to know the current shortlist of putatively LLM-proof languages (probably it is even less so). The field just moves too quickly, and even the best detection method can only be used once and will only catch out the least tech-aware among one’s students - which is not really the point. Two, my sense is that many university administrations are adopting an adversarial posture<https://educational-innovation.sydney.edu.au/teaching@sydney/false-flags-and-broken-trust-can-we-tell-if-ai-has-been-used/> towards AI-detection by teachers. Why this should be the case is an interesting (if depressing) topic to contemplate, but it at least implies that teachers should not assume AI-detection to fall within their reasonable (and safe…) range of responsibilities.

In sum: data sets as homework: pedagogically good in the absence of AI, unsound in the presence of AI. What can be done? If anything?

One approach which we’re planning to test out here (and I’d be interested to hear anyone’s opinions about, if they’ve tried anything similar - I realise that there are potentially some institution-specific policy/logistical issues involved) is to move problem set “homework” assessments into the tutorial/section/classroom/laboratory context, and have students work in groups to complete the problem set which they then submit at the end of the hour. There are some obvious pros and cons to this approach, but right now we’re more or less setting the low bar of, well - trying to get at least some students to do any sort of work at all.

Thanks again everyone
Mark

From: Juergen Bohnemeyer <jb77 at buffalo.edu>
Date: Friday, 7 November 2025 at 4:06 am
To: Mark Post <mark.post at sydney.edu.au>, typology list <lingtyp at listserv.linguistlist.org>
Subject: Re: "AI" and linguistics problem sets

Dear Mark — I’m actually surprised to hear that an AI bot is able to adequately solve your problem sets. My assumption, based on my own very limited experience with ChatGPT, has been that LMMs would perform so poorly at linguistic analysis that the results would dissuade students from trying again in the future. Would it be possible at all to share more details with us?

(One recommendation I have, which I however haven’t actually tried out, is to put a watermark of sorts in your assignments, in the form of a factual detail about some lesser-studied language. Even though such engines are of course quite capable of information retrieval, their very nature seems to predispose them toward predicting the answer rather than to looking it up. With the results being likely straightforwardly false.)

Best — Juergen

Juergen Bohnemeyer (He/Him)
Professor, Department of Linguistics
University at Buffalo

Office: 642 Baldy Hall, UB North Campus
Mailing address: 609 Baldy Hall, Buffalo, NY 14260
Phone: (716) 645 0127
Fax: (716) 645 3825
Email: jb77 at buffalo.edu<mailto:jb77 at buffalo.edu>
Web: http://www.acsu.buffalo.edu/~jb77/<https://url.au.m.mimecastprotect.com/s/O6TNCWLVXkUXv4zK7I6flCoL6Cb?domain=acsu.buffalo.edu/>

Office hours Tu/Th 3:30-4:30pm in 642 Baldy or via Zoom (Meeting ID 585 520 2411; Passcode Hoorheh)

There’s A Crack In Everything - That’s How The Light Gets In
(Leonard Cohen)

--

From: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> on behalf of Mark Post via Lingtyp <lingtyp at listserv.linguistlist.org>
Date: Tuesday, November 4, 2025 at 18:27
To: typology list <lingtyp at listserv.linguistlist.org>
Subject: [Lingtyp] "AI" and linguistics problem sets

Dear Listmembers,

I trust that most lingtyp subscribers will have engaged with “problem sets” of the type found in Language Files, Describing Morphosyntax, and my personal favourite oldie-but-goodie the Source Book for Linguistics. Since the advent of ChatGPT, I’ve been migrating away from these (and even edited/obscured versions of them) for assessments, and relying more and more on private/unpublished data sets, mostly from languages with lots of complex morphology and less familiar category types, that LLMs seemed to have a much harder time with. This was not an ideal situation for many reasons, not least of which being that these were not the only types of languages students should get practice working with. But the problem really came to a head this year, when I found that perhaps most off-the-shelf LLMs were now able to solve almost all of my go-to problem sets to an at least reasonable degree, even after I obscured much of the data.

Leaving aside issues around how LLMs work, what role(s) they can or should (not) play in linguistic research, etc., I’d like to ask if any listmembers would be willing to share their experiences, advice, etc., specifically in the area of student assessment in the teaching of linguistic data analysis, and in particular morphosyntax, in the unfolding AI-saturated environment. Is the “problem set” method of teaching distributional analysis irretrievably lost? Can it still be employed, and if so how? Are there different/better ways of teaching more or less the same skills?

Note that I would really like to avoid doomsdayisms if possible here (“the skills traditionally taught to linguists have already been made obsolete by AIs, such that there’s no point in teaching them anymore” - an argument with which I am all-too-familiar), and focus, if possible, on how it is possible to assess/evaluate students’ performance under the assumption that there is at least some value in teaching at least some human beings how to do a distributional analysis “by hand” - such that they are actually able, for example, to evaluate a machine’s performance in analysing a new/unfamiliar data set, and under the further assumption that assessment/evaluation of student performance in at least many institutions will continue to follow existing models.

Many thanks in advance!
Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20251111/a09ca3a4/attachment.htm>