[Lingtyp] Areal and phylogenetic researcher biases

Mon Sep 30 13:11:24 UTC 2024

Of course, "areal/phylogenetic researcher bias (APRB)" exists, and 
during the Grambank coding, I often heard Hedvig Skirgård talk about it 
as a potential issue. (I don't remember if it was addressed in a 
specific way, though.)

I don't know if it can be measured somehow (given the enormous diversity 
of researcher traditions, I'm a bit skeptical), but I think it can be 
mitigated if we are aware that the purpose of comparative concepts in 
typology is NOT to provide *analyses* – rather, it is to enable us to 
*classify* languages.

Volker Gast rightly says: "Two linguists working on the same language 
will often provide very different analyses, and both may be right in 
their own ways."

But while the *analyses* may well be different (because of the 
well-known non-uniqueness problem first highlighted by Yuen-Ren Chao in 
1934: https://dlc.hypotheses.org/3381), the *classifications* should not 
be different if the different linguists have access to the same information.

I wrote about this in the following blogpost, where I note that the 
"difficulties of classification" that typologists talk about are 
typically due to the unclarity of the comparative concepts, not 
necessarily to lack of data: https://dlc.hypotheses.org/2528.

In practice, of course, different linguists do not have access to the 
same kinds of data, and subjectiveness cannot be excluded entirely. 
However, if we are careful to distinguish between analyses/descriptions 
(at the p-level) and classifications and cross-linguistic 
generalizations (at the g-level), some problems will go away.

Best,

Martin

On 29.09.24 12:41, Volker Gast via Lingtyp wrote:
>
> Dear Jürgen and others,
>
> I think this is one of the major methodological problems of linguistic 
> typology (which, if I remember correctly, has been discussed on this 
> list before). There's no single 'correct' way of analysing a language. 
> Two linguists working on the same language will often provide very 
> different analyses, and both may be right in their own ways. It starts 
> with phonology, where you have a lot of degrees of freedom in, for 
> instance, minimizing or maximizing phoneme inventories (e.g. by [not] 
> introducing phonological domains and features operating on these 
> domains), and it gets worse in morphology, specifically if there is 
> distributed exponence and other complexities of this type. At the 
> level of syntax the impact of the specific theoretical background can 
> be seen, for instance, in publications using the UD corpora. These 
> corpora were annotated with a specific version of dependency grammar, 
> I think essentially for pragmatic reasons (dependency grammar was very 
> popular among computational linguists for a while). The theorerical 
> assumptions of the annotation model obviously have an impact on the 
> results (just think of the very old discussion of what a 'subject' is, 
> represented as the 'nsubj' relation in the UD annotations).
>
> For many languages we only have one description, and the linguist 
> describing it comes from a specific background or 'school' (and these 
> schools are often associated with particular areas and particular 
> phylogenetic groupings, introducing further biases of the type you 
> mention). Again, the effects are visible at the level of phonology 
> already. For example, the Papuan language Idi could be described as 
> having just three vowels, or as having nine vowels (perhaps even 
> more), depending on your assumptions about phonotactics etc. (There's 
> a published analysis of that language, by D. Schokkin, N. Evans, C. 
> Döhler and me, but the analysis really reflects some kind of 
> compromise between the authors, and it leaves a few non-trivial 
> questions open.)
>
> The specific linguist and their school or background is a source of 
> statistical non-independence. Even relying on exactly one description 
> per language, and having the data coded by several researchers, often 
> leads to low inter-annotator agreement in my experience.
>
> I think we need to be aware that typological data is behavioural data 
> at three layers: (i) language is a behavioural activity, (ii) 
> describing a language is a behavioural activity, and (iii) extracting 
> information from descriptions is another behavioural activity. 
> Variance occurs at all levels and is multiplied in the process from 
> (i) to (iii).
>
> Approximately determining the amount of variance of that type would be 
> a major project. For instance, we could have five undocumented 
> (unstandardized) languages described by five linguists each, using 
> data from five different speakers per language. Many will think that 
> this would be a waste of resources, given the number of (varieties) of 
> languages that still await description.
>
> What follows from all this, in my view, is that we need to be careful 
> in applying statistical analyses "blindly". Linguistics is not a 
> natural science. Given the large amount of inherent variance in 
> typological data we linguists should remain in the driver's seat and 
> use quantitative typological evidence as an assistance system, being 
> aware of its limits and possibilities, rather than take a back seat 
> and let the autopilot drive.
>
> Best,
> Volker (Gast)
>
>
> Am 28.09.2024 um 20:17 schrieb Juergen Bohnemeyer via Lingtyp:
>>
>> Dear all – I’m wondering whether anybody has attempted to estimate 
>> the size of the following putative effect on descriptive and 
>> typological research:
>>
>> Suppose there is a particular phenomenon in Language L, the known 
>> properties of which are equally compatible with an analysis in terms 
>> of construction types (comparative concepts) A and B.
>>
>> Suppose furthermore that L belongs to a language family and/or 
>> linguistic area such that A has much more commonly been invoked in 
>> descriptions of languages of that family/area than B.
>>
>> Then to the extent that a researcher attempting to adjudicate between 
>> A and B wrt. L (whether in a description of L, in a typological 
>> study, or in coding for an evolving typological database) is aware of 
>> the prevalence of A-coding/analyses for languages of the family/area 
>> in question, that might make them more likely to code/analyze L as 
>> exhibiting A as well.
>>
>> So for example, a researcher who assumes languages of the family/area 
>> of L to be typically tenseless may be influenced by this assumption 
>> and as a result become (however slightly) more likely to treat L as 
>> tenseless as well. In contrast, if she assumes languages of the 
>> family/area of L to be typically tensed, that might make her ever so 
>> slightly more likely to analyze L also as tensed.
>>
>> It seems to me that this is a cognitive bias related to, and possibly 
>> a case of, essentialism. (And just as in the case of (other forms of) 
>> essentialism, the actual cognitive causes/mechanisms of the bias may 
>> vary.)
>>
>> But regardless, my question is, again, has anybody tried to 
>> guestimate to what extent the results of current typological studies 
>> may be warped by this kind of researcher bias? (Note that the bias 
>> may be affecting both authors of descriptive work and typologists 
>> using descriptive work as data, so there is a possible double-whammy 
>> effect.)
>>
>> Thanks! – Juergen
>>
>> Juergen Bohnemeyer (He/Him)
>> Professor, Department of Linguistics
>> University at Buffalo
>>
-- 
Martin Haspelmath
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
D-04103 Leipzig
https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20240930/ab3f12d2/attachment.htm>

[Lingtyp] Areal and phylogenetic *researcher* biases

[Lingtyp] Areal and phylogenetic researcher biases