[Lingtyp] complex annotations and inter-rater reliability

Mon Jan 5 11:51:05 UTC 2026

Dear Björn,

Since you mentioned works on cross-linguistic inter-coder reliability as 
well (e.g. Himmelmann et al. 2018 on the universality of intonational 
phrases):

I think it's important to have clear and simple definitions of 
annotation categories, so if you are interested, for example, in "the 
coding of clause-initial “particles” (are they just particles, operators 
of “analytical mood”, or complementizers?)", you need to have clear and 
simple definitions of /particle/, /mood/, and /complementizer/ as 
comparative concepts. ("The burden is on those who formulate the 
guidelines", as Christian Lehmann said.)

I think one can define /particle/ as "a bound morph that is neither a 
root nor an affix nor a person form nor a linker", but this definition 
of course presupposes that one has a definition of "root", of "affix", 
and so on. These terms are not understood uniformly either, and /mood/ 
is perhaps the worst of all traditional terms (even worse than 
"subordination", I think).

Matters are quite different with materials from little-studied 
languages, i.e. with "transcribing and annotating recordings", as 
described by Jürgen Bohnemeyer. Language-particular descriptive 
categories are much easier to identify across texts than comparatively 
defined categories are to identify across languages.

Best wishes for the New Year,

Martin

On 03.01.26 12:54, Wiemer, Bjoern via Lingtyp wrote:
>
> Dear All,
>
> since this seems to be the first post on this list this year, I wish 
> everybody a successful, more peaceful and decent year than the 
> previous one.
>
> I want to raise an issue which gets back to a discussion from October 
> 2023 on this list (see the thread below, in inverse chronological 
> order). I’m interested to know whether anybody has a satisfying answer 
> to the question how to deal with semantic annotation, or the 
> annotation of more complex (and less obvious) relations, in particular 
> with the annotation of interclausal relations, both in terms of syntax 
> and in semantic terms. Problems arise already with the 
> coordination-subordination gradient, which ultimately is an outcome of 
> a complex bunch of semantic criteria (like independence of 
> illocutionary force, perspective from which referential expressions 
> like tense or person deixis are interpreted; see also the factors that 
> were analyzed meticulously, e.g., by Verstraete 2007). Other questions 
> concern the coding of clause-initial “particles”: are they just 
> particles, operators of “analytical mood”, or complementizers? 
> (Notably, these things do not exclude one another, but they heavily 
> depend on one’s theory, in particular one’s stance toward 
> complementation and mood.) Another case in point is the annotation of 
> the functions and properties of constructions in TAME-domains, 
> especially if the annotation grid is more fine-grained than mainstream 
> categorizing.
>
> The problems which I have encountered (in pilot studies) are very 
> similar to those discussed in October 2023 for seemingly even 
> “simpler”, or more coarse-grained annotations. And they aggravate a 
> lot when we turn to data from diachronic corpora: even if being an 
> informed native speaker is usually an asset, with diachronic data this 
> asset is often useless, and native knowledge may be even a hindrance 
> since it leads the analyst to project one’s habits and norms of 
> contemporary usage to earlier stages of the “same” language. (Similar 
> points apply for closely related languages.) I entirely agree that 
> annotators have to be trained, and grids of annotation to be tested, 
> first of all because you have to exclude the (very likely) possibility 
> that raters disagree just because some of the criteria are not clear 
> to at least one of them (with the consequence that you cannot know 
> whether disagreement or low Kappa doesn’t result from 
> misunderstandings, instead of reflecting properties of your object of 
> study). I also agree that each criterion of a grid has to be 
> sufficiently defined, and the annotation grid (or even its “history”) 
> as such be documented in order to save objective criteria for 
> replicability and comparability (for cross-linguistic research, but 
> also for diachronic studies based on a series of “synchronic cuts” of 
> the given language).
>
> On this background, I’d like to formulate the following questions:
>
>  1. Which arguments are there that (informed) native speakers are
>     better annotators than linguistically well-trained
>     students/linguists who are not native speakers of the respective
>     language(s), but can be considered experts?
>  2. Conversely, which arguments are there that non-native speaker
>     experts might be even better suited as annotators (for this or
>     that kind of issue)?
>  3. Have assumptions about pluses and minuses of both kinds of
>     annotators been tested in practice? That is, do we have empirical
>     evidence for any such assumptions (or do we just rely on some sort
>     of common sense, or on the personal experience of those who have
>     done more complicated annotation work)?
>  4. How can pluses and minuses of both kinds of annotators be
>     counterbalanced in a not too time (and money) consuming way?
>  5. What can we do with data from diachronic corpora if we have to
>     admit that (informed) native speakers are of no use, and
>     non-native experts are not acknowledged, either? Are we just
>     deemed to refrain from any reliable and valid in-depth research
>     based on annotations (and statistics) for diachronically earlier
>     stages and for diachronic change?
>  6. In connection with this, has any cross-linguistic research that is
>     interested in diachrony tried to implement insights from such
>     fields like historical semantics and pragmatics into annotations?
>     In typology, linguistic change has increasingly become more
>     prominent during the last 10-15 years (not only from a
>     macro-perspective). I thus wonder whether typologists have tried
>     to “borrow” methodology from fields that have possibly been better
>     in interpreting diachronic data, and even quantify them (to some
>     extent).
>
> I don’t want to be too pessimistic, but if we have no good answers as 
> for who should be doing annotations – informed native speakers or 
> non-native experts (or only those who are both native and experts)? – 
> and how we might be able to test the validity of annotation grids (for 
> comparisons across time and/or languages), there won’t be convincing 
> arguments how to deal with diachronic data (or data of lesser studied 
> languages for which there might be no native speakers available) in 
> empirical studies that are to disclose more fine-grained distinctions 
> and changes, also in order to quantify them. In particular, reviewers 
> of project applications may always ask for a convincing methodology, 
> and if no such research is funded we’ll remain ignorant of quite many 
> reasons and backgrounds of language change.
>
> I’d appreciate advice, in particular if it provides answers to any of 
> the questions under 1-6 above.
>
> Best,
>
> Björn (Wiemer).
>
> *Von:*Lingtyp <lingtyp-bounces at listserv.linguistlist.org> *Im Auftrag 
> von *William Croft
> *Gesendet:* Montag, 16. Oktober 2023 15:52
> *An:* Volker Gast <volker.gast at uni-jena.de>
> *Cc:* LINGTYP at LISTSERV.LINGUISTLIST.ORG
> *Betreff:* Re: [Lingtyp] typology projects that use inter-rater 
> reliability?
>
> An early cross-linguistic study with multiple annotators is this one:
>
> Gundel, Jeannette K., Nancy Hedberg & Ron Zacharski. 1993. Cognitive 
> status and the form of referring expressions in discourse. 
> /Language/ 69.274-307.
>
> It doesn’t have all the documentation that Volker suggests; our 
> standards for providing documentation has risen.
>
> I have been involved in annotation projects in natural language 
> processing, where the aim is to annotate corpora so that automated 
> methods can “learn” the annotation categories from the “gold standard” 
> (i.e. “expert”) annotation -- this is supervised learning in NLP. 
> Recent efforts are aiming at developing a single annotation scheme for 
> use across languages, such as Universal Dependencies (for syntactic 
> annotation), Uniform Meaning Representation (for semantic annotation), 
> and Unimorph (for morphological annotation). My experience is somewhat 
> similar to Volker’s: even when the annotation scheme is very 
> coarse-grained (from a theoretical linguist’s point of view), getting 
> good enough interannotator agreement is hard, even when the annotators 
> are the ones who designed the scheme, or are native speakers or have 
> done fieldwork on the language. I would add to Volker’s comments that 
> one has to be trained for annotation; but that training can introduce 
> (mostly implicit) bases, at least in the eyes of proponents of a 
> different theoretical approach -- something that is more apparent in a 
> field such as linguistics where there are large differences in 
> theoretical approaches.
>
> Bill
>
>
>
>     On Oct 16, 2023, at 6:02 AM, Volker Gast <volker.gast at uni-jena.de
>     <mailto:volker.gast at uni-jena.de>> wrote:
>
>
>     Hey Adam (and others),
>
>     I think you could phrase the question differently: What
>     typological studies have been carried out with multiple annotators
>     and careful documentation of the annotation process, including
>     precise annotation guidelines, the training of the annotators,
>     publication of all the (individual) annotations, calculation of
>     inter-annotator agreement etc.?
>
>     I think there are very few. The reason is that the process is very
>     time-consuming, and "risky". I was a member of a project
>     co-directed with Vahram Atayan (Heidelberg) where we carried out
>     very careful annotations dealing with what we call 'adverbials of
>     immediate posteriority' (see the references below). Even though we
>     only dealt with a few well-known European languages, it took us
>     quite some time to develop annotation guidelines and train
>     annotators. The inter-rater agreement was surprisingly low even
>     for categories that appeared straightforward to us, e.g.
>     agentivity of a predicate; and we were dealing with well-known
>     languages (English, German, French, Spanish, Italian). So the
>     outcomes of this process were very moderate in comparison with the
>     work that went into the annotations. (Note that the project was
>     primarily situated in the field of contrastive linguistics and
>     translation studies, not linguistic typology, but the challenges
>     are the same).
>
>     It's a dilemma: as a field, we often fail to meet even the most
>     basic methodological requirements that are standardly made in
>     other fields (most notably psychology). I know of at least two
>     typological projects where inter-rater agreement tests were run,
>     but the results were so poor that a decision was made to not
>     pursue this any further (meaning, the projects were continued, but
>     without inter-annotator agreement tests; that's what makes
>     annotation projects "risky": what do you do if you never reach a
>     satisfactory level of inter-annotator agreement?). Most annotation
>     projects, including some of my own earlier work, are based on what
>     we euphemistically call 'expert annotation', with 'expert'
>     referring to ourselves, the authors. Today I would minimally
>     expect the annotations to be done by someone who is not an author,
>     and I try to implement that requirement in my role as a journal
>     editor (Linguistics), but it's hard. We do want to see more
>     empirical work published, and if the methodological standards are
>     too high, we will end publishing nothing at all.
>
>     I'd be very happy if there were community standards for this, and
>     I'd like to hear about any iniatives implementing more rigorous
>     methodological standards in lingusitic typology. Honestly, I
>     wouldn't know what to require. But it seems clear to me that we
>     cannot simply go on like this, annotating our own data, which we
>     subsequently analyze, as it is well known that annotation
>     decisions are influenced by (mostly implicit) biases.
>
>     Best,
>     Volker
>
>     Gast, Volker & Vahram Atayan (2019). 'Adverbials of immediate
>     posteriority in French and German: A contrastive corpus study of
>     tout de suite, immédiatement, gleich and sofort'. In Emonds, J.,
>     M. Janebová & L. Veselovská (eds.): Language Use and Linguistic
>     Structure. Proceedings of the Olomouc Linguistics Colloquium 2018,
>     403-430. Olomouc Modern Lanuage Series. Olomouc: Palacký
>     University Olomouc.
>
>     in German:
>
>     Atayan, V., B. Fetzer, V. Gast, D. Möller, T. Ronalter (2019).
>     'Ausdrucksformen der unmittelbaren Nachzeitigkeit in Originalen
>     und Übersetzungen: Eine Pilotstudie zu den deutschen Adverbien
>     gleich und sofort und ihren Äquivalenten im Französischen,
>     Italienischen, Spanischen und Englischen'. In Ahrens, B., S.
>     Hansen-Schirra, M. Krein-Kühle, M. Schreiber, U. Wienen (eds.):
>     Translation -- Linguistik -- Semiotik, 11-82. Berlin: Frank & Timme.
>
>     Gast, V., V. Atayan, J. Biege, B. Fetzer, S. Hettrich, A. Weber
>     (2019). 'Unmittelbare Nachzeitigkeit im Deutschen und
>     Französischen: Eine Studie auf Grundlage des
>     OpenSubtitles-Korpus'. In Konecny, C., C. Konzett, E. Lavric, W.
>     Pöckl (eds.): Comparatio delectat III. Akten der VIII.
>     Internationalen Arbeitstagung zum romanisch-deutschen und
>     innerromanischen Sprachvergleich, 223-249. Frankfurt: Lang.
>
>
>     ---
>     Prof. V. Gast
>     https://linktype.iaa.uni-jena.de/VG
>     <https://linktype.iaa.uni-jena.de/VG>
>
>     On Sat, 14 Oct 2023, Adam James Ross Tallman wrote:
>
>
>         Hello all,
>         I am gathering a list of projects / citations / papers that
>         use or refer to inter-rater reliability. So far I have.
>         Himmelmann et al. 2018. On the universality of intonational
>         phrases: a cross-linguistic interrater study. Phonology 35.
>         Gast & Koptjevskaja-Tamm. 2022. Patterns of persistence and
>         diffusibility in the European lexicon. Linguistic Typology
>         (not explicitly the topic of the paper, but interrater
>         reliability metrics are used)
>         I understand people working with Grambank have used it, but I
>         don't know if there is a publication on that.
>         best,
>         Adam
>         --
>         Adam J.R. Tallman
>         Post-doctoral Researcher
>         Friedrich Schiller Universität
>         Department of English Studies
>
>     _______________________________________________
>     Lingtyp mailing list
>     Lingtyp at listserv.linguistlist.org
>     <mailto:Lingtyp at listserv.linguistlist.org>
>     https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
>     <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp

-- 
Martin Haspelmath
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
D-04103 Leipzig
https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20260105/89104435/attachment.htm>