[Lingtyp] typology projects that use inter-rater reliability?

Mon Oct 16 13:52:00 UTC 2023

An early cross-linguistic study with multiple annotators is this one:

Gundel, Jeannette K., Nancy Hedberg & Ron Zacharski. 1993. Cognitive status and the form of referring expressions in discourse. Language 69.274-307.

It doesn’t have all the documentation that Volker suggests; our standards for providing documentation has risen.

I have been involved in annotation projects in natural language processing, where the aim is to annotate corpora so that automated methods can “learn” the annotation categories from the “gold standard” (i.e. “expert”) annotation -- this is supervised learning in NLP. Recent efforts are aiming at developing a single annotation scheme for use across languages, such as Universal Dependencies (for syntactic annotation), Uniform Meaning Representation (for semantic annotation), and Unimorph (for morphological annotation). My experience is somewhat similar to Volker’s: even when the annotation scheme is very coarse-grained (from a theoretical linguist’s point of view), getting good enough interannotator agreement is hard, even when the annotators are the ones who designed the scheme, or are native speakers or have done fieldwork on the language. I would add to Volker’s comments that one has to be trained for annotation; but that training can introduce (mostly implicit) bases, at least in the eyes of proponents of a different theoretical approach -- something that is more apparent in a field such as linguistics where there are large differences in theoretical approaches.

Bill

> On Oct 16, 2023, at 6:02 AM, Volker Gast <volker.gast at uni-jena.de> wrote:
> 
> 
> Hey Adam (and others),
> 
> I think you could phrase the question differently: What typological studies have been carried out with multiple annotators and careful documentation of the annotation process, including precise annotation guidelines, the training of the annotators, publication of all the (individual) annotations, calculation of inter-annotator agreement etc.?
> 
> I think there are very few. The reason is that the process is very time-consuming, and "risky". I was a member of a project co-directed with Vahram Atayan (Heidelberg) where we carried out very careful annotations dealing with what we call 'adverbials of immediate posteriority' (see the references below). Even though we only dealt with a few well-known European languages, it took us quite some time to develop annotation guidelines and train annotators. The inter-rater agreement was surprisingly low even for categories that appeared straightforward to us, e.g. agentivity of a predicate; and we were dealing with well-known languages (English, German, French, Spanish, Italian). So the outcomes of this process were very moderate in comparison with the work that went into the annotations. (Note that the project was primarily situated in the field of contrastive linguistics and translation studies, not linguistic typology, but the challenges are the same).
> 
> It's a dilemma: as a field, we often fail to meet even the most basic methodological requirements that are standardly made in other fields (most notably psychology). I know of at least two typological projects where inter-rater agreement tests were run, but the results were so poor that a decision was made to not pursue this any further (meaning, the projects were continued, but without inter-annotator agreement tests; that's what makes annotation projects "risky": what do you do if you never reach a satisfactory level of inter-annotator agreement?). Most annotation projects, including some of my own earlier work, are based on what we euphemistically call 'expert annotation', with 'expert' referring to ourselves, the authors. Today I would minimally expect the annotations to be done by someone who is not an author, and I try to implement that requirement in my role as a journal editor (Linguistics), but it's hard. We do want to see more empirical work published, and if the methodological standards are too high, we will end publishing nothing at all.
> 
> I'd be very happy if there were community standards for this, and I'd like to hear about any iniatives implementing more rigorous methodological standards in lingusitic typology. Honestly, I wouldn't know what to require. But it seems clear to me that we cannot simply go on like this, annotating our own data, which we subsequently analyze, as it is well known that annotation decisions are influenced by (mostly implicit) biases.
> 
> Best,
> Volker
> 
> Gast, Volker & Vahram Atayan (2019). 'Adverbials of immediate posteriority in French and German: A contrastive corpus study of tout de suite, immédiatement, gleich and sofort'. In Emonds, J., M. Janebová & L. Veselovská (eds.): Language Use and Linguistic Structure. Proceedings of the Olomouc Linguistics Colloquium 2018, 403-430. Olomouc Modern Lanuage Series. Olomouc: Palacký University Olomouc.
> 
> in German:
> 
> Atayan, V., B. Fetzer, V. Gast, D. Möller, T. Ronalter (2019). 'Ausdrucksformen der unmittelbaren Nachzeitigkeit in Originalen und Übersetzungen: Eine Pilotstudie zu den deutschen Adverbien gleich und sofort und ihren Äquivalenten im Französischen, Italienischen, Spanischen und Englischen'. In Ahrens, B., S. Hansen-Schirra, M. Krein-Kühle, M. Schreiber, U. Wienen (eds.): Translation -- Linguistik -- Semiotik, 11-82. Berlin: Frank & Timme.
> 
> Gast, V., V. Atayan, J. Biege, B. Fetzer, S. Hettrich, A. Weber (2019). 'Unmittelbare Nachzeitigkeit im Deutschen und Französischen: Eine Studie auf Grundlage des OpenSubtitles-Korpus'. In Konecny, C., C. Konzett, E. Lavric, W. Pöckl (eds.): Comparatio delectat III. Akten der VIII. Internationalen Arbeitstagung zum romanisch-deutschen und innerromanischen Sprachvergleich, 223-249. Frankfurt: Lang.
> 
> 
> ---
> Prof. V. Gast
> https://linktype.iaa.uni-jena.de/VG
> 
> On Sat, 14 Oct 2023, Adam James Ross Tallman wrote:
> 
>> Hello all,
>> I am gathering a list of projects / citations / papers that use or refer to inter-rater reliability. So far I have.
>> Himmelmann et al. On the universality of intonational phrases: a cross-linguistic interrater study. Phonology 35.
>> Gast & Koptjevskaja-Tamm. 2022. Patterns of persistence and diffusibility in the European lexicon. Linguistic Typology (not explicitly the topic of the paper, but interrater reliability metrics are used)
>> I understand people working with Grambank have used it, but I don't know if there is a publication on that.
>> best,
>> Adam
>> --
>> Adam J.R. Tallman
>> Post-doctoral Researcher
>> Friedrich Schiller Universität
>> Department of English Studies
> _______________________________________________
> Lingtyp mailing list
> Lingtyp at listserv.linguistlist.org
> https://listserv.linguistlist.org/cgi-bin/mailman/listinfo/lingtyp

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20231016/d1f703ff/attachment.htm>