[Lingtyp] complex annotations and inter-rater reliability

Sun Jan 4 11:42:42 UTC 2026

Dear Christian,
thanks for your suggestions. As a first reaction, I’d like to point out two problems which you seem to skip over (or take for granted, though they cannot).
                First, your “if we ignore these [semantic and pragmatic factors] for a moment”. This is to beg one big question. One reason is that you need to understand how fine-grained (or coarse) your grid (value set) for a given distinction can or should be.
                Second, you require a “complete linguistic description of the language”. This looks much like a postulate from strict structuralism that you have to know all elements and their relations to each other (“où tout se tient”) before you may determine the meaning of a particular item in an utterance. To my knowledge, this strict postulate has never been met in reality (and probably it cannot be met by 100% for any language). And how will you do this for historically earlier stages that are more often than not documented, let alone described in structural terms, rather fragmentarily?
                Brought to its logical end, if we have to base our work strictly on these principles, we are forced to say that we cannot do reliable research in diachronic change (at least such one in which semantic and pragmatic functions occupy center stage)…

Best,
Björn.

Von: Lingtyp <lingtyp-bounces at listserv.linguistlist.org> Im Auftrag von Christian Lehmann via Lingtyp
Gesendet: Sonntag, 4. Januar 2026 12:13
An: lingtyp at listserv.linguistlist.org
Betreff: Re: [Lingtyp] complex annotations and inter-rater reliability

Dear Björn,

I have never studied systematically the quality of the product of different annotators, so please consider me incompetent in this respect. However, a presupposition of any such study is obviously a definition of what a good/correct annotation is. Such a definition would be possible on certain conditions:

  1.  The utterance to be annotated has one linguistic (phonological, grammatical, semantic) structure. This implies that its meaning is known and there is no (licit) variation of annotations reflecting an ambiguity in the data.
  2.  There is a complete linguistic description of the language. Among other things, it comprises lists of all linguistic units, the regularities in their distribution and the set of constructions that they form.
  3.  On the basis of this description, annotation guidelines are formulated which provide a procedure by which the identity of a unit found in an utterance is to be determined.
  4.  The annotation grid stipulates a representation for every linguistic unit to be annotated.

If all of this (unless I forget anything) could be made formally explicit, then even an algorithm could produce a correct annotation. It cannot be made fully explicit because of semantic and pragmatic factors which cannot be systematized. Now if we ignore these for a moment, then a given annotation is either correct or false, and the comparison of products of annotators boils down to an examination of whether their annotations are correct. Given this, it would seem to be of secondary importance whether an annotator is a native speaker or a linguist or what not; the only question is to what extent he or she obeys the guidelines.

The moral of my argument is: the burden is principally on the shoulders of the person who formulates the guidelines. The annotator can do no better than these.

--------------------------------------------------
Am 03.01.2026 um 12:54 schrieb Wiemer, Bjoern via Lingtyp:
Dear All,
since this seems to be the first post on this list this year, I wish everybody a successful, more peaceful and decent year than the previous one.

I want to raise an issue which gets back to a discussion from October 2023 on this list (see the thread below, in inverse chronological order). I’m interested to know whether anybody has a satisfying answer to the question how to deal with semantic annotation, or the annotation of more complex (and less obvious) relations, in particular with the annotation of interclausal relations, both in terms of syntax and in semantic terms. Problems arise already with the coordination-subordination gradient, which ultimately is an outcome of a complex bunch of semantic criteria (like independence of illocutionary force, perspective from which referential expressions like tense or person deixis are interpreted; see also the factors that were analyzed meticulously, e.g., by Verstraete 2007). Other questions concern the coding of clause-initial “particles”: are they just particles, operators of “analytical mood”, or complementizers? (Notably, these things do not exclude one another, but they heavily depend on one’s theory, in particular one’s stance toward complementation and mood.) Another case in point is the annotation of the functions and properties of constructions in TAME-domains, especially if the annotation grid is more fine-grained than mainstream categorizing.
                The problems which I have encountered (in pilot studies) are very similar to those discussed in October 2023 for seemingly even “simpler”, or more coarse-grained annotations. And they aggravate a lot when we turn to data from diachronic corpora: even if being an informed native speaker is usually an asset, with diachronic data this asset is often useless, and native knowledge may be even a hindrance since it leads the analyst to project one’s habits and norms of contemporary usage to earlier stages of the “same” language. (Similar points apply for closely related languages.) I entirely agree that annotators have to be trained, and grids of annotation to be tested, first of all because you have to exclude the (very likely) possibility that raters disagree just because some of the criteria are not clear to at least one of them (with the consequence that you cannot know whether disagreement or low Kappa doesn’t result from misunderstandings, instead of reflecting properties of your object of study). I also agree that each criterion of a grid has to be sufficiently defined, and the annotation grid (or even its “history”) as such be documented in order to save objective criteria for replicability and comparability (for cross-linguistic research, but also for diachronic studies based on a series of “synchronic cuts” of the given language).

On this background, I’d like to formulate the following questions:

  1.  Which arguments are there that (informed) native speakers are better annotators than linguistically well-trained students/linguists who are not native speakers of the respective language(s), but can be considered experts?
  2.  Conversely, which arguments are there that non-native speaker experts might be even better suited as annotators (for this or that kind of issue)?
  3.  Have assumptions about pluses and minuses of both kinds of annotators been tested in practice? That is, do we have empirical evidence for any such assumptions (or do we just rely on some sort of common sense, or on the personal experience of those who have done more complicated annotation work)?
  4.  How can pluses and minuses of both kinds of annotators be counterbalanced in a not too time (and money) consuming way?
  5.  What can we do with data from diachronic corpora if we have to admit that (informed) native speakers are of no use, and non-native experts are not acknowledged, either? Are we just deemed to refrain from any reliable and valid in-depth research based on annotations (and statistics) for diachronically earlier stages and for diachronic change?
  6.  In connection with this, has any cross-linguistic research that is interested in diachrony tried to implement insights from such fields like historical semantics and pragmatics into annotations? In typology, linguistic change has increasingly become more prominent during the last 10-15 years (not only from a macro-perspective). I thus wonder whether typologists have tried to “borrow” methodology from fields that have possibly been better in interpreting diachronic data, and even quantify them (to some extent).

I don’t want to be too pessimistic, but if we have no good answers as for who should be doing annotations – informed native speakers or non-native experts (or only those who are both native and experts)? – and how we might be able to test the validity of annotation grids (for comparisons across time and/or languages), there won’t be convincing arguments how to deal with diachronic data (or data of lesser studied languages for which there might be no native speakers available) in empirical studies that are to disclose more fine-grained distinctions and changes, also in order to quantify them. In particular, reviewers of project applications may always ask for a convincing methodology, and if no such research is funded we’ll remain ignorant of quite many reasons and backgrounds of language change.

I’d appreciate advice, in particular if it provides answers to any of the questions under 1-6 above.

Best,
Björn (Wiemer).
 --

Prof. em. Dr. Christian Lehmann
Rudolfstr. 4
99092 Erfurt
Deutschland
Tel.:

+49/361/2113417

E-Post:

christianw_lehmann at arcor.de<mailto:christianw_lehmann at arcor.de>

Web:

https://www.christianlehmann.eu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20260104/78fe50e1/attachment.htm>