[Lingtyp] complex annotations and inter-rater reliability

Christian Lehmann christian.lehmann at uni-erfurt.de
Sun Jan 4 11:48:12 UTC 2026


Dear Björn,

I should have added that I am well aware that a description such as I 
postulated is not available for many languages. However, if this is so, 
requiring good annotations despite the absence of a complete description 
plus guidelines amounts to requiring that annotators do the work that 
the linguist employing them has not done.

Best,

Christian

------------------------------------------------------------------------------------------------------

Am 04.01.2026 um 12:42 schrieb Wiemer, Bjoern:
>
> Dear Christian,
>
> thanks for your suggestions. As a first reaction, I’d like to point 
> out two problems which you seem to skip over (or take for granted, 
> though they cannot).
>
> First, your “if we ignore these [semantic and pragmatic factors] for a 
> moment”. This is to beg one big question. One reason is that you need 
> to understand how fine-grained (or coarse) your grid (value set) for a 
> given distinction can or should be.
>
> Second, you require a “complete linguistic description of the 
> language”. This looks much like a postulate from strict structuralism 
> that you have to know all elements and their relations to each other 
> (“où tout se tient”) before you may determine the meaning of a 
> particular item in an utterance. To my knowledge, this strict 
> postulate has never been met in reality (and probably it cannot be met 
> by 100% for any language). And how will you do this for historically 
> earlier stages that are more often than not documented, let alone 
> described in structural terms, rather fragmentarily?
>
> Brought to its logical end, if we have to base our work strictly on 
> these principles, we are forced to say that we cannot do reliable 
> research in diachronic change (at least such one in which semantic and 
> pragmatic functions occupy center stage)…
>
> Best,
>
> Björn.
>
> *Von:*Lingtyp <lingtyp-bounces at listserv.linguistlist.org> *Im Auftrag 
> von *Christian Lehmann via Lingtyp
> *Gesendet:* Sonntag, 4. Januar 2026 12:13
> *An:* lingtyp at listserv.linguistlist.org
> *Betreff:* Re: [Lingtyp] complex annotations and inter-rater reliability
>
> Dear Björn,
>
> I have never studied systematically the quality of the product of 
> different annotators, so please consider me incompetent in this 
> respect. However, a presupposition of any such study is obviously a 
> definition of what a good/correct annotation is. Such a definition 
> would be possible on certain conditions:
>
>  1. The utterance to be annotated has one linguistic (phonological,
>     grammatical, semantic) structure. This implies that its meaning is
>     known and there is no (licit) variation of annotations reflecting
>     an ambiguity in the data.
>  2. There is a complete linguistic description of the language. Among
>     other things, it comprises lists of all linguistic units, the
>     regularities in their distribution and the set of constructions
>     that they form.
>  3. On the basis of this description, annotation guidelines are
>     formulated which provide a procedure by which the identity of a
>     unit found in an utterance is to be determined.
>  4. The annotation grid stipulates a representation for every
>     linguistic unit to be annotated.
>
> If all of this (unless I forget anything) could be made formally 
> explicit, then even an algorithm could produce a correct annotation. 
> It cannot be made fully explicit because of semantic and pragmatic 
> factors which cannot be systematized. Now if we ignore these for a 
> moment, then a given annotation is either correct or false, and the 
> comparison of products of annotators boils down to an examination of 
> whether their annotations are correct. Given this, it would seem to be 
> of secondary importance whether an annotator is a native speaker or a 
> linguist or what not; the only question is to what extent he or she 
> obeys the guidelines.
>
> The moral of my argument is: the burden is principally on the 
> shoulders of the person who formulates the guidelines. The annotator 
> can do no better than these.
>
> --------------------------------------------------
>
> Am 03.01.2026 um 12:54 schrieb Wiemer, Bjoern via Lingtyp:
>
>     Dear All,
>
>     since this seems to be the first post on this list this year, I
>     wish everybody a successful, more peaceful and decent year than
>     the previous one.
>
>     I want to raise an issue which gets back to a discussion from
>     October 2023 on this list (see the thread below, in inverse
>     chronological order). I’m interested to know whether anybody has a
>     satisfying answer to the question how to deal with semantic
>     annotation, or the annotation of more complex (and less obvious)
>     relations, in particular with the annotation of interclausal
>     relations, both in terms of syntax and in semantic terms. Problems
>     arise already with the coordination-subordination gradient, which
>     ultimately is an outcome of a complex bunch of semantic criteria
>     (like independence of illocutionary force, perspective from which
>     referential expressions like tense or person deixis are
>     interpreted; see also the factors that were analyzed meticulously,
>     e.g., by Verstraete 2007). Other questions concern the coding of
>     clause-initial “particles”: are they just particles, operators of
>     “analytical mood”, or complementizers? (Notably, these things do
>     not exclude one another, but they heavily depend on one’s theory,
>     in particular one’s stance toward complementation and mood.)
>     Another case in point is the annotation of the functions and
>     properties of constructions in TAME-domains, especially if the
>     annotation grid is more fine-grained than mainstream categorizing.
>
>     The problems which I have encountered (in pilot studies) are very
>     similar to those discussed in October 2023 for seemingly even
>     “simpler”, or more coarse-grained annotations. And they aggravate
>     a lot when we turn to data from diachronic corpora: even if being
>     an informed native speaker is usually an asset, with diachronic
>     data this asset is often useless, and native knowledge may be even
>     a hindrance since it leads the analyst to project one’s habits and
>     norms of contemporary usage to earlier stages of the “same”
>     language. (Similar points apply for closely related languages.) I
>     entirely agree that annotators have to be trained, and grids of
>     annotation to be tested, first of all because you have to exclude
>     the (very likely) possibility that raters disagree just because
>     some of the criteria are not clear to at least one of them (with
>     the consequence that you cannot know whether disagreement or low
>     Kappa doesn’t result from misunderstandings, instead of reflecting
>     properties of your object of study). I also agree that each
>     criterion of a grid has to be sufficiently defined, and the
>     annotation grid (or even its “history”) as such be documented in
>     order to save objective criteria for replicability and
>     comparability (for cross-linguistic research, but also for
>     diachronic studies based on a series of “synchronic cuts” of the
>     given language).
>
>     On this background, I’d like to formulate the following questions:
>
>      1. Which arguments are there that (informed) native speakers are
>         better annotators than linguistically well-trained
>         students/linguists who are not native speakers of the
>         respective language(s), but can be considered experts?
>      2. Conversely, which arguments are there that non-native speaker
>         experts might be even better suited as annotators (for this or
>         that kind of issue)?
>      3. Have assumptions about pluses and minuses of both kinds of
>         annotators been tested in practice? That is, do we have
>         empirical evidence for any such assumptions (or do we just
>         rely on some sort of common sense, or on the personal
>         experience of those who have done more complicated annotation
>         work)?
>      4. How can pluses and minuses of both kinds of annotators be
>         counterbalanced in a not too time (and money) consuming way?
>      5. What can we do with data from diachronic corpora if we have to
>         admit that (informed) native speakers are of no use, and
>         non-native experts are not acknowledged, either? Are we just
>         deemed to refrain from any reliable and valid in-depth
>         research based on annotations (and statistics) for
>         diachronically earlier stages and for diachronic change?
>      6. In connection with this, has any cross-linguistic research
>         that is interested in diachrony tried to implement insights
>         from such fields like historical semantics and pragmatics into
>         annotations? In typology, linguistic change has increasingly
>         become more prominent during the last 10-15 years (not only
>         from a macro-perspective). I thus wonder whether typologists
>         have tried to “borrow” methodology from fields that have
>         possibly been better in interpreting diachronic data, and even
>         quantify them (to some extent).
>
>     I don’t want to be too pessimistic, but if we have no good answers
>     as for who should be doing annotations – informed native speakers
>     or non-native experts (or only those who are both native and
>     experts)? – and how we might be able to test the validity of
>     annotation grids (for comparisons across time and/or languages),
>     there won’t be convincing arguments how to deal with diachronic
>     data (or data of lesser studied languages for which there might be
>     no native speakers available) in empirical studies that are to
>     disclose more fine-grained distinctions and changes, also in order
>     to quantify them. In particular, reviewers of project applications
>     may always ask for a convincing methodology, and if no such
>     research is funded we’ll remain ignorant of quite many reasons and
>     backgrounds of language change.
>
>     I’d appreciate advice, in particular if it provides answers to any
>     of the questions under 1-6 above.
>
>     Best,
>
>     Björn (Wiemer).
>
>     -- 
>
> Prof. em. Dr. Christian Lehmann
> Rudolfstr. 4
> 99092 Erfurt
> Deutschland
>
> Tel.:
>
> 	
>
> +49/361/2113417
>
> E-Post:
>
> 	
>
> christianw_lehmann at arcor.de
>
> Web:
>
> 	
>
> https://www.christianlehmann.eu
>
-- 

Prof. em. Dr. Christian Lehmann
Rudolfstr. 4
99092 Erfurt
Deutschland

Tel.: 	+49/361/2113417
E-Post: 	christianw_lehmann at arcor.de
Web: 	https://www.christianlehmann.eu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20260104/ba834a68/attachment.htm>


More information about the Lingtyp mailing list