[Lingtyp] complex annotations and inter-rater reliability
Martin Haspelmath
martin_haspelmath at eva.mpg.de
Mon Jan 5 11:51:05 UTC 2026
Dear Björn,
Since you mentioned works on cross-linguistic inter-coder reliability as
well (e.g. Himmelmann et al. 2018 on the universality of intonational
phrases):
I think it's important to have clear and simple definitions of
annotation categories, so if you are interested, for example, in "the
coding of clause-initial “particles” (are they just particles, operators
of “analytical mood”, or complementizers?)", you need to have clear and
simple definitions of /particle/, /mood/, and /complementizer/ as
comparative concepts. ("The burden is on those who formulate the
guidelines", as Christian Lehmann said.)
I think one can define /particle/ as "a bound morph that is neither a
root nor an affix nor a person form nor a linker", but this definition
of course presupposes that one has a definition of "root", of "affix",
and so on. These terms are not understood uniformly either, and /mood/
is perhaps the worst of all traditional terms (even worse than
"subordination", I think).
Matters are quite different with materials from little-studied
languages, i.e. with "transcribing and annotating recordings", as
described by Jürgen Bohnemeyer. Language-particular descriptive
categories are much easier to identify across texts than comparatively
defined categories are to identify across languages.
Best wishes for the New Year,
Martin
On 03.01.26 12:54, Wiemer, Bjoern via Lingtyp wrote:
>
> Dear All,
>
> since this seems to be the first post on this list this year, I wish
> everybody a successful, more peaceful and decent year than the
> previous one.
>
> I want to raise an issue which gets back to a discussion from October
> 2023 on this list (see the thread below, in inverse chronological
> order). I’m interested to know whether anybody has a satisfying answer
> to the question how to deal with semantic annotation, or the
> annotation of more complex (and less obvious) relations, in particular
> with the annotation of interclausal relations, both in terms of syntax
> and in semantic terms. Problems arise already with the
> coordination-subordination gradient, which ultimately is an outcome of
> a complex bunch of semantic criteria (like independence of
> illocutionary force, perspective from which referential expressions
> like tense or person deixis are interpreted; see also the factors that
> were analyzed meticulously, e.g., by Verstraete 2007). Other questions
> concern the coding of clause-initial “particles”: are they just
> particles, operators of “analytical mood”, or complementizers?
> (Notably, these things do not exclude one another, but they heavily
> depend on one’s theory, in particular one’s stance toward
> complementation and mood.) Another case in point is the annotation of
> the functions and properties of constructions in TAME-domains,
> especially if the annotation grid is more fine-grained than mainstream
> categorizing.
>
> The problems which I have encountered (in pilot studies) are very
> similar to those discussed in October 2023 for seemingly even
> “simpler”, or more coarse-grained annotations. And they aggravate a
> lot when we turn to data from diachronic corpora: even if being an
> informed native speaker is usually an asset, with diachronic data this
> asset is often useless, and native knowledge may be even a hindrance
> since it leads the analyst to project one’s habits and norms of
> contemporary usage to earlier stages of the “same” language. (Similar
> points apply for closely related languages.) I entirely agree that
> annotators have to be trained, and grids of annotation to be tested,
> first of all because you have to exclude the (very likely) possibility
> that raters disagree just because some of the criteria are not clear
> to at least one of them (with the consequence that you cannot know
> whether disagreement or low Kappa doesn’t result from
> misunderstandings, instead of reflecting properties of your object of
> study). I also agree that each criterion of a grid has to be
> sufficiently defined, and the annotation grid (or even its “history”)
> as such be documented in order to save objective criteria for
> replicability and comparability (for cross-linguistic research, but
> also for diachronic studies based on a series of “synchronic cuts” of
> the given language).
>
> On this background, I’d like to formulate the following questions:
>
> 1. Which arguments are there that (informed) native speakers are
> better annotators than linguistically well-trained
> students/linguists who are not native speakers of the respective
> language(s), but can be considered experts?
> 2. Conversely, which arguments are there that non-native speaker
> experts might be even better suited as annotators (for this or
> that kind of issue)?
> 3. Have assumptions about pluses and minuses of both kinds of
> annotators been tested in practice? That is, do we have empirical
> evidence for any such assumptions (or do we just rely on some sort
> of common sense, or on the personal experience of those who have
> done more complicated annotation work)?
> 4. How can pluses and minuses of both kinds of annotators be
> counterbalanced in a not too time (and money) consuming way?
> 5. What can we do with data from diachronic corpora if we have to
> admit that (informed) native speakers are of no use, and
> non-native experts are not acknowledged, either? Are we just
> deemed to refrain from any reliable and valid in-depth research
> based on annotations (and statistics) for diachronically earlier
> stages and for diachronic change?
> 6. In connection with this, has any cross-linguistic research that is
> interested in diachrony tried to implement insights from such
> fields like historical semantics and pragmatics into annotations?
> In typology, linguistic change has increasingly become more
> prominent during the last 10-15 years (not only from a
> macro-perspective). I thus wonder whether typologists have tried
> to “borrow” methodology from fields that have possibly been better
> in interpreting diachronic data, and even quantify them (to some
> extent).
>
> I don’t want to be too pessimistic, but if we have no good answers as
> for who should be doing annotations – informed native speakers or
> non-native experts (or only those who are both native and experts)? –
> and how we might be able to test the validity of annotation grids (for
> comparisons across time and/or languages), there won’t be convincing
> arguments how to deal with diachronic data (or data of lesser studied
> languages for which there might be no native speakers available) in
> empirical studies that are to disclose more fine-grained distinctions
> and changes, also in order to quantify them. In particular, reviewers
> of project applications may always ask for a convincing methodology,
> and if no such research is funded we’ll remain ignorant of quite many
> reasons and backgrounds of language change.
>
> I’d appreciate advice, in particular if it provides answers to any of
> the questions under 1-6 above.
>
> Best,
>
> Björn (Wiemer).
>
> *Von:*Lingtyp <lingtyp-bounces at listserv.linguistlist.org> *Im Auftrag
> von *William Croft
> *Gesendet:* Montag, 16. Oktober 2023 15:52
> *An:* Volker Gast <volker.gast at uni-jena.de>
> *Cc:* LINGTYP at LISTSERV.LINGUISTLIST.ORG
> *Betreff:* Re: [Lingtyp] typology projects that use inter-rater
> reliability?
>
> An early cross-linguistic study with multiple annotators is this one:
>
> Gundel, Jeannette K., Nancy Hedberg & Ron Zacharski. 1993. Cognitive
> status and the form of referring expressions in discourse.
> /Language/ 69.274-307.
>
> It doesn’t have all the documentation that Volker suggests; our
> standards for providing documentation has risen.
>
> I have been involved in annotation projects in natural language
> processing, where the aim is to annotate corpora so that automated
> methods can “learn” the annotation categories from the “gold standard”
> (i.e. “expert”) annotation -- this is supervised learning in NLP.
> Recent efforts are aiming at developing a single annotation scheme for
> use across languages, such as Universal Dependencies (for syntactic
> annotation), Uniform Meaning Representation (for semantic annotation),
> and Unimorph (for morphological annotation). My experience is somewhat
> similar to Volker’s: even when the annotation scheme is very
> coarse-grained (from a theoretical linguist’s point of view), getting
> good enough interannotator agreement is hard, even when the annotators
> are the ones who designed the scheme, or are native speakers or have
> done fieldwork on the language. I would add to Volker’s comments that
> one has to be trained for annotation; but that training can introduce
> (mostly implicit) bases, at least in the eyes of proponents of a
> different theoretical approach -- something that is more apparent in a
> field such as linguistics where there are large differences in
> theoretical approaches.
>
> Bill
>
>
>
> On Oct 16, 2023, at 6:02 AM, Volker Gast <volker.gast at uni-jena.de
> <mailto:volker.gast at uni-jena.de>> wrote:
>
>
> Hey Adam (and others),
>
> I think you could phrase the question differently: What
> typological studies have been carried out with multiple annotators
> and careful documentation of the annotation process, including
> precise annotation guidelines, the training of the annotators,
> publication of all the (individual) annotations, calculation of
> inter-annotator agreement etc.?
>
> I think there are very few. The reason is that the process is very
> time-consuming, and "risky". I was a member of a project
> co-directed with Vahram Atayan (Heidelberg) where we carried out
> very careful annotations dealing with what we call 'adverbials of
> immediate posteriority' (see the references below). Even though we
> only dealt with a few well-known European languages, it took us
> quite some time to develop annotation guidelines and train
> annotators. The inter-rater agreement was surprisingly low even
> for categories that appeared straightforward to us, e.g.
> agentivity of a predicate; and we were dealing with well-known
> languages (English, German, French, Spanish, Italian). So the
> outcomes of this process were very moderate in comparison with the
> work that went into the annotations. (Note that the project was
> primarily situated in the field of contrastive linguistics and
> translation studies, not linguistic typology, but the challenges
> are the same).
>
> It's a dilemma: as a field, we often fail to meet even the most
> basic methodological requirements that are standardly made in
> other fields (most notably psychology). I know of at least two
> typological projects where inter-rater agreement tests were run,
> but the results were so poor that a decision was made to not
> pursue this any further (meaning, the projects were continued, but
> without inter-annotator agreement tests; that's what makes
> annotation projects "risky": what do you do if you never reach a
> satisfactory level of inter-annotator agreement?). Most annotation
> projects, including some of my own earlier work, are based on what
> we euphemistically call 'expert annotation', with 'expert'
> referring to ourselves, the authors. Today I would minimally
> expect the annotations to be done by someone who is not an author,
> and I try to implement that requirement in my role as a journal
> editor (Linguistics), but it's hard. We do want to see more
> empirical work published, and if the methodological standards are
> too high, we will end publishing nothing at all.
>
> I'd be very happy if there were community standards for this, and
> I'd like to hear about any iniatives implementing more rigorous
> methodological standards in lingusitic typology. Honestly, I
> wouldn't know what to require. But it seems clear to me that we
> cannot simply go on like this, annotating our own data, which we
> subsequently analyze, as it is well known that annotation
> decisions are influenced by (mostly implicit) biases.
>
> Best,
> Volker
>
> Gast, Volker & Vahram Atayan (2019). 'Adverbials of immediate
> posteriority in French and German: A contrastive corpus study of
> tout de suite, immédiatement, gleich and sofort'. In Emonds, J.,
> M. Janebová & L. Veselovská (eds.): Language Use and Linguistic
> Structure. Proceedings of the Olomouc Linguistics Colloquium 2018,
> 403-430. Olomouc Modern Lanuage Series. Olomouc: Palacký
> University Olomouc.
>
> in German:
>
> Atayan, V., B. Fetzer, V. Gast, D. Möller, T. Ronalter (2019).
> 'Ausdrucksformen der unmittelbaren Nachzeitigkeit in Originalen
> und Übersetzungen: Eine Pilotstudie zu den deutschen Adverbien
> gleich und sofort und ihren Äquivalenten im Französischen,
> Italienischen, Spanischen und Englischen'. In Ahrens, B., S.
> Hansen-Schirra, M. Krein-Kühle, M. Schreiber, U. Wienen (eds.):
> Translation -- Linguistik -- Semiotik, 11-82. Berlin: Frank & Timme.
>
> Gast, V., V. Atayan, J. Biege, B. Fetzer, S. Hettrich, A. Weber
> (2019). 'Unmittelbare Nachzeitigkeit im Deutschen und
> Französischen: Eine Studie auf Grundlage des
> OpenSubtitles-Korpus'. In Konecny, C., C. Konzett, E. Lavric, W.
> Pöckl (eds.): Comparatio delectat III. Akten der VIII.
> Internationalen Arbeitstagung zum romanisch-deutschen und
> innerromanischen Sprachvergleich, 223-249. Frankfurt: Lang.
>
>
> ---
> Prof. V. Gast
> https://linktype.iaa.uni-jena.de/VG
> <https://linktype.iaa.uni-jena.de/VG>
>
> On Sat, 14 Oct 2023, Adam James Ross Tallman wrote:
>
>
> Hello all,
> I am gathering a list of projects / citations / papers that
> use or refer to inter-rater reliability. So far I have.
> Himmelmann et al. 2018. On the universality of intonational
> phrases: a cross-linguistic interrater study. Phonology 35.
> Gast & Koptjevskaja-Tamm. 2022. Patterns of persistence and
> diffusibility in the European lexicon. Linguistic Typology
> (not explicitly the topic of the paper, but interrater
> reliability metrics are used)
> I understand people working with Grambank have used it, but I
> don't know if there is a publication on that.
> best,
> Adam
> --
> Adam J.R. Tallman
> Post-doctoral Researcher
> Friedrich Schiller Universität
> Department of English Studies
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> <mailto:Lingtyp at listserv.linguistlist.org>
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
> <https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp>
>
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
--
Martin Haspelmath
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
D-04103 Leipzig
https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20260105/89104435/attachment.htm>
More information about the Lingtyp
mailing list