[Lingtyp] Areal and phylogenetic *researcher* biases
Martin Haspelmath
martin_haspelmath at eva.mpg.de
Mon Sep 30 13:11:24 UTC 2024
Of course, "areal/phylogenetic researcher bias (APRB)" exists, and
during the Grambank coding, I often heard Hedvig Skirgård talk about it
as a potential issue. (I don't remember if it was addressed in a
specific way, though.)
I don't know if it can be measured somehow (given the enormous diversity
of researcher traditions, I'm a bit skeptical), but I think it can be
mitigated if we are aware that the purpose of comparative concepts in
typology is NOT to provide *analyses* – rather, it is to enable us to
*classify* languages.
Volker Gast rightly says: "Two linguists working on the same language
will often provide very different analyses, and both may be right in
their own ways."
But while the *analyses* may well be different (because of the
well-known non-uniqueness problem first highlighted by Yuen-Ren Chao in
1934: https://dlc.hypotheses.org/3381), the *classifications* should not
be different if the different linguists have access to the same information.
I wrote about this in the following blogpost, where I note that the
"difficulties of classification" that typologists talk about are
typically due to the unclarity of the comparative concepts, not
necessarily to lack of data: https://dlc.hypotheses.org/2528.
In practice, of course, different linguists do not have access to the
same kinds of data, and subjectiveness cannot be excluded entirely.
However, if we are careful to distinguish between analyses/descriptions
(at the p-level) and classifications and cross-linguistic
generalizations (at the g-level), some problems will go away.
Best,
Martin
On 29.09.24 12:41, Volker Gast via Lingtyp wrote:
>
> Dear Jürgen and others,
>
> I think this is one of the major methodological problems of linguistic
> typology (which, if I remember correctly, has been discussed on this
> list before). There's no single 'correct' way of analysing a language.
> Two linguists working on the same language will often provide very
> different analyses, and both may be right in their own ways. It starts
> with phonology, where you have a lot of degrees of freedom in, for
> instance, minimizing or maximizing phoneme inventories (e.g. by [not]
> introducing phonological domains and features operating on these
> domains), and it gets worse in morphology, specifically if there is
> distributed exponence and other complexities of this type. At the
> level of syntax the impact of the specific theoretical background can
> be seen, for instance, in publications using the UD corpora. These
> corpora were annotated with a specific version of dependency grammar,
> I think essentially for pragmatic reasons (dependency grammar was very
> popular among computational linguists for a while). The theorerical
> assumptions of the annotation model obviously have an impact on the
> results (just think of the very old discussion of what a 'subject' is,
> represented as the 'nsubj' relation in the UD annotations).
>
> For many languages we only have one description, and the linguist
> describing it comes from a specific background or 'school' (and these
> schools are often associated with particular areas and particular
> phylogenetic groupings, introducing further biases of the type you
> mention). Again, the effects are visible at the level of phonology
> already. For example, the Papuan language Idi could be described as
> having just three vowels, or as having nine vowels (perhaps even
> more), depending on your assumptions about phonotactics etc. (There's
> a published analysis of that language, by D. Schokkin, N. Evans, C.
> Döhler and me, but the analysis really reflects some kind of
> compromise between the authors, and it leaves a few non-trivial
> questions open.)
>
> The specific linguist and their school or background is a source of
> statistical non-independence. Even relying on exactly one description
> per language, and having the data coded by several researchers, often
> leads to low inter-annotator agreement in my experience.
>
> I think we need to be aware that typological data is behavioural data
> at three layers: (i) language is a behavioural activity, (ii)
> describing a language is a behavioural activity, and (iii) extracting
> information from descriptions is another behavioural activity.
> Variance occurs at all levels and is multiplied in the process from
> (i) to (iii).
>
> Approximately determining the amount of variance of that type would be
> a major project. For instance, we could have five undocumented
> (unstandardized) languages described by five linguists each, using
> data from five different speakers per language. Many will think that
> this would be a waste of resources, given the number of (varieties) of
> languages that still await description.
>
> What follows from all this, in my view, is that we need to be careful
> in applying statistical analyses "blindly". Linguistics is not a
> natural science. Given the large amount of inherent variance in
> typological data we linguists should remain in the driver's seat and
> use quantitative typological evidence as an assistance system, being
> aware of its limits and possibilities, rather than take a back seat
> and let the autopilot drive.
>
> Best,
> Volker (Gast)
>
>
> Am 28.09.2024 um 20:17 schrieb Juergen Bohnemeyer via Lingtyp:
>>
>> Dear all – I’m wondering whether anybody has attempted to estimate
>> the size of the following putative effect on descriptive and
>> typological research:
>>
>> Suppose there is a particular phenomenon in Language L, the known
>> properties of which are equally compatible with an analysis in terms
>> of construction types (comparative concepts) A and B.
>>
>> Suppose furthermore that L belongs to a language family and/or
>> linguistic area such that A has much more commonly been invoked in
>> descriptions of languages of that family/area than B.
>>
>> Then to the extent that a researcher attempting to adjudicate between
>> A and B wrt. L (whether in a description of L, in a typological
>> study, or in coding for an evolving typological database) is aware of
>> the prevalence of A-coding/analyses for languages of the family/area
>> in question, that might make them more likely to code/analyze L as
>> exhibiting A as well.
>>
>> So for example, a researcher who assumes languages of the family/area
>> of L to be typically tenseless may be influenced by this assumption
>> and as a result become (however slightly) more likely to treat L as
>> tenseless as well. In contrast, if she assumes languages of the
>> family/area of L to be typically tensed, that might make her ever so
>> slightly more likely to analyze L also as tensed.
>>
>> It seems to me that this is a cognitive bias related to, and possibly
>> a case of, essentialism. (And just as in the case of (other forms of)
>> essentialism, the actual cognitive causes/mechanisms of the bias may
>> vary.)
>>
>> But regardless, my question is, again, has anybody tried to
>> guestimate to what extent the results of current typological studies
>> may be warped by this kind of researcher bias? (Note that the bias
>> may be affecting both authors of descriptive work and typologists
>> using descriptive work as data, so there is a possible double-whammy
>> effect.)
>>
>> Thanks! – Juergen
>>
>> Juergen Bohnemeyer (He/Him)
>> Professor, Department of Linguistics
>> University at Buffalo
>>
--
Martin Haspelmath
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
D-04103 Leipzig
https://www.eva.mpg.de/linguistic-and-cultural-evolution/staff/martin-haspelmath/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20240930/ab3f12d2/attachment.htm>
More information about the Lingtyp
mailing list