[Lingtyp] complex annotations and inter-rater reliability

Mon Jan 5 18:20:08 UTC 2026

Dear Björn,

some very incomplete (and not entirely direct) answers to some of your
questions:

>> 6. In connection with this, has any cross-linguistic research that is
interested in diachrony tried to implement insights from such fields like
historical semantics and pragmatics into annotations? In typology,
linguistic change has increasingly become more prominent during the last
10-15 years (not only from a macro-perspective). I thus wonder whether
typologists have tried to “borrow” methodology from fields that have
possibly been better in interpreting diachronic data, and even quantify
them (to some extent).

In computational linguistics, there has recently been a surge of interest
towards *lexical *semantic change. Put very roughly, most studies follow
the following workflow: given a research question, manually annotate a
dataset; use the dataset to train a model that solves a relevant task (e.g.
detects change); evaluate the performance of the model on a held-out part
of the dataset, and, depending on the evaluation results, decide whether
the model is hopeless or can be use instead of (or in addition to) human
annotators for large-scale quantitative studies. Most studies focus on one
language (Schlechtweg et al. 2025
<https://link.springer.com/article/10.1007/s10579-024-09771-7>, McGillivray
et al. 2022
<https://www.degruyterbrill.com/document/doi/10.1515/joll-2022-2007/html>),
some include several (Schlechtweg et al. 2020
<https://aclanthology.org/2020.semeval-1.1/>), see also an overview by
McGillivray (2020
<https://www.taylorfrancis.com/chapters/edit/10.4324/9780429777028-20/computational-methods-semantic-analysis-historical-texts-barbara-mcgillivray>).
I am not aware of any truly typological studies, but the underlying idea
usually is that the computational methods are universal, that if a model
that solves a particular task is good, it can be trained on a dataset for
any stage of any language.

I'd say that this field is an example of (somewhat) cross-linguistic
research that is interested in diachrony and that is trying to implement
insights from historical semantics, and that's why I name it. However, it
is probably not directly relevant for you, since your questions, if I
understand correctly, are more about grammatical semantics and syntax, and
then perhaps *treebanking* is a more relevant subfield of computational
linguistics. In order to provide a sentence with a syntactic tree and
provide a part-of-speech/morphological label for every word, it is
necessary to address the questions like those you list: is this a particle
or a complementizer, is clause A subordinate to clause B or coordinated
with it. [On a side note: in my experience, however, systematic
measurements of inter-annotation agreement are rare in the corpus/treebank
field: partly because they are very expensive (and it always feels like
it's better to annotate 100 new sentences than reannotate the same 100
sentences), partly because (I suspect) the results are likely to be
depressing.] Given this view of treebank development, the whole Universal
Dependencies <https://universaldependencies.org/> project is a
cross-linguistic research direction which is interested in diachrony (since
there a quite a few historical treebanks in the collection, and more are
coming) and is certainly trying to implement insights from historical
semantics and pragmatics (to what extent may vary depending on the
treebank).

>> 5. What can we do with data from diachronic corpora if we have to admit
that (informed) native speakers are of no use, and non-native experts are
not acknowledged, either? Are we just deemed to refrain from any reliable
and valid in-depth research based on annotations (and statistics) for
diachronically earlier stages and for diachronic change?

A very optimistic computational linguist would say that if semantic
information can be inferred from context (and we can test whether it is by
constructing and evaluating automatic tools for living languages, where
native speakers can provide us with some kind of ground truth), than a
language model should be able to annotate diachronic corpora, provided that
there is enough data (in other words, become a "native speaker" of a given
language stage). I dare not say whether this is true in general case.
In any case, I would not be so pessimistic as to say that no quantitative
diachronic research is possible. As in other fields, we'd have to come up
with ways of finding converging evidence from various sources to test our
hypotheses.

>> 1. Which arguments are there that (informed) native speakers are better
annotators than linguistically well-trained students/linguists who are not
native speakers of the respective language(s), but can be considered
experts?

>From my experience of annotating Old East Slavic, Old Church Slavonic,
Russian and Swedish treebanks, I'd say that being a native speaker (of the
modern language) or not does not play a decisive role: it's the knowledge
of the particular language (and relevant scholarly literature and
traditions) that matters. Despite being a native speaker of Russian, I felt
that I performed worse on OES than non-native speakers who had more
extensive training in historical Slavic linguistics. It did happen that my
native intuition gave me some unique insights, or, vice versa, confounded
me (as you suggest it may), but those cases weren't frequent. My answer to
your questions 1-2 (an intuitive answer based on anecdotal evidence!) is
then: does not really matter. But then while treebank annotators do have to
solve the problems like those you list, they cannot afford to go into each
single problem too deeply (is this subordination or coordination?), we/they
have to sacrifice some of the depth for breadth. If you are interested in
deeper, fine-grained and more semantic analysis, the answer may be
different (though my guess would still be that it isn't).

In response to your introductory paragraphs: as William Croft mentioned in
the previous thread, in computational linguistics in general, a lot of
attention has been paid to inter-annotator agreement: how to measure it,
how to improve it and what to do when it cannot be improved (resolve every
single disagreement? Cast away all problematic cases? Somehow aggregate
labels from several annotators and use an "average" one? Embrace the
variation and accept that some datapoints will have more than one label?).

Best regards,
Sasha

---
Aleksandrs Berdicevskis
Researcher, Associate professor
Språkbanken Text
Department of Swedish, Multilingualism, Language Technology
University of Gothenburg

On Mon, 5 Jan 2026 at 12:51, Martin Haspelmath via Lingtyp <
lingtyp at listserv.linguistlist.org> wrote:

> Dear Björn,
>
> Since you mentioned works on cross-linguistic inter-coder reliability as
> well (e.g. Himmelmann et al. 2018 on the universality of intonational
> phrases):
>
> I think it's important to have clear and simple definitions of annotation
> categories, so if you are interested, for example, in "the coding of
> clause-initial “particles” (are they just particles, operators of
> “analytical mood”, or complementizers?)", you need to have clear and simple
> definitions of *particle*, *mood*, and *complementizer* as comparative
> concepts. ("The burden is on those who formulate the guidelines", as
> Christian Lehmann said.)
>
> I think one can define *particle* as "a bound morph that is neither a
> root nor an affix nor a person form nor a linker", but this definition of
> course presupposes that one has a definition of "root", of "affix", and so
> on. These terms are not understood uniformly either, and *mood* is
> perhaps the worst of all traditional terms (even worse than
> "subordination", I think).
>
> Matters are quite different with materials from little-studied languages,
> i.e. with "transcribing and annotating recordings", as described by
> Jürgen Bohnemeyer. Language-particular descriptive categories are much
> easier to identify across texts than comparatively defined categories are
> to identify across languages.
>
> Best wishes for the New Year,
>
> Martin
> On 03.01.26 12:54, Wiemer, Bjoern via Lingtyp wrote:
>
> Dear All,
>
> since this seems to be the first post on this list this year, I wish
> everybody a successful, more peaceful and decent year than the previous one.
>
>
>
> I want to raise an issue which gets back to a discussion from October 2023
> on this list (see the thread below, in inverse chronological order). I’m
> interested to know whether anybody has a satisfying answer to the question
> how to deal with semantic annotation, or the annotation of more complex
> (and less obvious) relations, in particular with the annotation of
> interclausal relations, both in terms of syntax and in semantic terms.
> Problems arise already with the coordination-subordination gradient, which
> ultimately is an outcome of a complex bunch of semantic criteria (like
> independence of illocutionary force, perspective from which referential
> expressions like tense or person deixis are interpreted; see also the
> factors that were analyzed meticulously, e.g., by Verstraete 2007). Other
> questions concern the coding of clause-initial “particles”: are they just
> particles, operators of “analytical mood”, or complementizers? (Notably,
> these things do not exclude one another, but they heavily depend on one’s
> theory, in particular one’s stance toward complementation and mood.)
> Another case in point is the annotation of the functions and properties of
> constructions in TAME-domains, especially if the annotation grid is more
> fine-grained than mainstream categorizing.
>
>                 The problems which I have encountered (in pilot studies)
> are very similar to those discussed in October 2023 for seemingly even
> “simpler”, or more coarse-grained annotations. And they aggravate a lot
> when we turn to data from diachronic corpora: even if being an informed
> native speaker is usually an asset, with diachronic data this asset is
> often useless, and native knowledge may be even a hindrance since it leads
> the analyst to project one’s habits and norms of contemporary usage to
> earlier stages of the “same” language. (Similar points apply for closely
> related languages.) I entirely agree that annotators have to be trained,
> and grids of annotation to be tested, first of all because you have to
> exclude the (very likely) possibility that raters disagree just because
> some of the criteria are not clear to at least one of them (with the
> consequence that you cannot know whether disagreement or low Kappa doesn’t
> result from misunderstandings, instead of reflecting properties of your
> object of study). I also agree that each criterion of a grid has to be
> sufficiently defined, and the annotation grid (or even its “history”) as
> such be documented in order to save objective criteria for replicability
> and comparability (for cross-linguistic research, but also for diachronic
> studies based on a series of “synchronic cuts” of the given language).
>
>
>
> On this background, I’d like to formulate the following questions:
>
>    1. Which arguments are there that (informed) native speakers are
>    better annotators than linguistically well-trained students/linguists who
>    are not native speakers of the respective language(s), but can be
>    considered experts?
>    2. Conversely, which arguments are there that non-native speaker
>    experts might be even better suited as annotators (for this or that kind of
>    issue)?
>    3. Have assumptions about pluses and minuses of both kinds of
>    annotators been tested in practice? That is, do we have empirical evidence
>    for any such assumptions (or do we just rely on some sort of common sense,
>    or on the personal experience of those who have done more complicated
>    annotation work)?
>    4. How can pluses and minuses of both kinds of annotators be
>    counterbalanced in a not too time (and money) consuming way?
>    5. What can we do with data from diachronic corpora if we have to
>    admit that (informed) native speakers are of no use, and non-native experts
>    are not acknowledged, either? Are we just deemed to refrain from any
>    reliable and valid in-depth research based on annotations (and statistics)
>    for diachronically earlier stages and for diachronic change?
>    6. In connection with this, has any cross-linguistic research that is
>    interested in diachrony tried to implement insights from such fields like
>    historical semantics and pragmatics into annotations? In typology,
>    linguistic change has increasingly become more prominent during the last
>    10-15 years (not only from a macro-perspective). I thus wonder whether
>    typologists have tried to “borrow” methodology from fields that have
>    possibly been better in interpreting diachronic data, and even quantify
>    them (to some extent).
>
>
>
> I don’t want to be too pessimistic, but if we have no good answers as for
> who should be doing annotations – informed native speakers or non-native
> experts (or only those who are both native and experts)? – and how we might
> be able to test the validity of annotation grids (for comparisons across
> time and/or languages), there won’t be convincing arguments how to deal
> with diachronic data (or data of lesser studied languages for which there
> might be no native speakers available) in empirical studies that are to
> disclose more fine-grained distinctions and changes, also in order to
> quantify them. In particular, reviewers of project applications may always
> ask for a convincing methodology, and if no such research is funded we’ll
> remain ignorant of quite many reasons and backgrounds of language change.
>
>
>
> I’d appreciate advice, in particular if it provides answers to any of the
> questions under 1-6 above.
>
>
>
> Best,
>
> Björn (Wiemer).
>
>
>
>
>
> *Von:* Lingtyp <lingtyp-bounces at listserv.linguistlist.org>
> <lingtyp-bounces at listserv.linguistlist.org> *Im Auftrag von *William Croft
> *Gesendet:* Montag, 16. Oktober 2023 15:52
> *An:* Volker Gast <volker.gast at uni-jena.de> <volker.gast at uni-jena.de>
> *Cc:* LINGTYP at LISTSERV.LINGUISTLIST.ORG
> *Betreff:* Re: [Lingtyp] typology projects that use inter-rater
> reliability?
>
>
>
> An early cross-linguistic study with multiple annotators is this one:
>
>
>
> Gundel, Jeannette K., Nancy Hedberg & Ron Zacharski. 1993. Cognitive
> status and the form of referring expressions in discourse. *Language*
>  69.274-307.
>
>
>
> It doesn’t have all the documentation that Volker suggests; our standards
> for providing documentation has risen.
>
>
>
> I have been involved in annotation projects in natural language
> processing, where the aim is to annotate corpora so that automated methods
> can “learn” the annotation categories from the “gold standard” (i.e.
> “expert”) annotation -- this is supervised learning in NLP. Recent efforts
> are aiming at developing a single annotation scheme for use across
> languages, such as Universal Dependencies (for syntactic annotation),
> Uniform Meaning Representation (for semantic annotation), and Unimorph (for
> morphological annotation). My experience is somewhat similar to Volker’s:
> even when the annotation scheme is very coarse-grained (from a theoretical
> linguist’s point of view), getting good enough interannotator agreement is
> hard, even when the annotators are the ones who designed the scheme, or are
> native speakers or have done fieldwork on the language. I would add to
> Volker’s comments that one has to be trained for annotation; but that
> training can introduce (mostly implicit) bases, at least in the eyes of
> proponents of a different theoretical approach -- something that is more
> apparent in a field such as linguistics where there are large differences
> in theoretical approaches.
>
>
>
> Bill
>
>
>
> On Oct 16, 2023, at 6:02 AM, Volker Gast <volker.gast at uni-jena.de> wrote:
>
>
>
>
> Hey Adam (and others),
>
> I think you could phrase the question differently: What typological
> studies have been carried out with multiple annotators and careful
> documentation of the annotation process, including precise annotation
> guidelines, the training of the annotators, publication of all the
> (individual) annotations, calculation of inter-annotator agreement etc.?
>
> I think there are very few. The reason is that the process is very
> time-consuming, and "risky". I was a member of a project co-directed with
> Vahram Atayan (Heidelberg) where we carried out very careful annotations
> dealing with what we call 'adverbials of immediate posteriority' (see the
> references below). Even though we only dealt with a few well-known European
> languages, it took us quite some time to develop annotation guidelines and
> train annotators. The inter-rater agreement was surprisingly low even for
> categories that appeared straightforward to us, e.g. agentivity of a
> predicate; and we were dealing with well-known languages (English, German,
> French, Spanish, Italian). So the outcomes of this process were very
> moderate in comparison with the work that went into the annotations. (Note
> that the project was primarily situated in the field of contrastive
> linguistics and translation studies, not linguistic typology, but the
> challenges are the same).
>
> It's a dilemma: as a field, we often fail to meet even the most basic
> methodological requirements that are standardly made in other fields (most
> notably psychology). I know of at least two typological projects where
> inter-rater agreement tests were run, but the results were so poor that a
> decision was made to not pursue this any further (meaning, the projects
> were continued, but without inter-annotator agreement tests; that's what
> makes annotation projects "risky": what do you do if you never reach a
> satisfactory level of inter-annotator agreement?). Most annotation
> projects, including some of my own earlier work, are based on what we
> euphemistically call 'expert annotation', with 'expert' referring to
> ourselves, the authors. Today I would minimally expect the annotations to
> be done by someone who is not an author, and I try to implement that
> requirement in my role as a journal editor (Linguistics), but it's hard. We
> do want to see more empirical work published, and if the methodological
> standards are too high, we will end publishing nothing at all.
>
> I'd be very happy if there were community standards for this, and I'd like
> to hear about any iniatives implementing more rigorous methodological
> standards in lingusitic typology. Honestly, I wouldn't know what to
> require. But it seems clear to me that we cannot simply go on like this,
> annotating our own data, which we subsequently analyze, as it is well known
> that annotation decisions are influenced by (mostly implicit) biases.
>
> Best,
> Volker
>
> Gast, Volker & Vahram Atayan (2019). 'Adverbials of immediate posteriority
> in French and German: A contrastive corpus study of tout de suite,
> immédiatement, gleich and sofort'. In Emonds, J., M. Janebová & L.
> Veselovská (eds.): Language Use and Linguistic Structure. Proceedings of
> the Olomouc Linguistics Colloquium 2018, 403-430. Olomouc Modern Lanuage
> Series. Olomouc: Palacký University Olomouc.
>
> in German:
>
> Atayan, V., B. Fetzer, V. Gast, D. Möller, T. Ronalter (2019).
> 'Ausdrucksformen der unmittelbaren Nachzeitigkeit in Originalen und
> Übersetzungen: Eine Pilotstudie zu den deutschen Adverbien gleich und
> sofort und ihren Äquivalenten im Französischen, Italienischen, Spanischen
> und Englischen'. In Ahrens, B., S. Hansen-Schirra, M. Krein-Kühle, M.
> Schreiber, U. Wienen (eds.): Translation -- Linguistik -- Semiotik, 11-82.
> Berlin: Frank & Timme.
>
> Gast, V., V. Atayan, J. Biege, B. Fetzer, S. Hettrich, A. Weber (2019).
> 'Unmittelbare Nachzeitigkeit im Deutschen und Französischen: Eine Studie
> auf Grundlage des OpenSubtitles-Korpus'. In Konecny, C., C. Konzett, E.
> Lavric, W. Pöckl (eds.): Comparatio delectat III. Akten der VIII.
> Internationalen Arbeitstagung zum romanisch-deutschen und innerromanischen
> Sprachvergleich, 223-249. Frankfurt: Lang.
>
>
> ---
> Prof. V. Gast
> https://linktype.iaa.uni-jena.de/VG
>
> On Sat, 14 Oct 2023, Adam James Ross Tallman wrote:
>
>
> Hello all,
> I am gathering a list of projects / citations / papers that use or refer
> to inter-rater reliability. So far I have.
> Himmelmann et al. 2018. On the universality of intonational phrases: a
> cross-linguistic interrater study. Phonology 35.
> Gast & Koptjevskaja-Tamm. 2022. Patterns of persistence and diffusibility
> in the European lexicon. Linguistic Typology (not explicitly the topic of
> the paper, but interrater reliability metrics are used)
> I understand people working with Grambank have used it, but I don't know
> if there is a publication on that.
> best,
> Adam
> --
> Adam J.R. Tallman
> Post-doctoral Researcher
> Friedrich Schiller Universität
> Department of English Studies
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
>
>
>
> _______________________________________________
> Lingtyp mailing listLingtyp at listserv.linguistlist.orghttps://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
>
> --
> Martin Haspelmath
> Max Planck Institute for Evolutionary Anthropology
> Deutscher Platz 6
> D-04103 Leipzighttps://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
>
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20260105/aca8bd79/attachment.htm>