[Lingtyp] Greenbergian word order universals: confirmed after all

Tue Nov 7 15:54:34 UTC 2023

Dear Martin,

As you say, the reliability of these studies hinges on the cognate coding,
> which is done manually, by humans with their biases. I'm wondering if there
> is a way to measure the degree to which different linguists agree or not
> (by some kind of kappa statistic), and a way to identify or exclude
> systematic biases (which are part of normal human behaviour).
>
This is not something that has been done systematically in the past, but
indeed a measure of inter-annotator agreement should be done in future
research on phylogenies (when large teams are involved).

> Another thing that I worry about is that grammatical markers (even
> demonstratives and interrogatives) are ignored (see the list of 170
> comparison meanings in IE-COR: https://iecor.clld.org/parameters), even
> though we know that these are the most resistant to borrowing.
>
Demonstratives and interrogatives are part of Swadesh's original 200 word
list, and some studies include them (for instance our 2019 Sino-Tibetan
phylogeny <https://www.pnas.org/doi/10.1073/pnas.1817972116>). In most
families, it is actually surprisingly difficult to deal with these forms.
The problem is that the phylogenetic methods that we use require the etyma
in the various lexical meanings to be independent from each other: this is
the reason why the word lists must be tailored to each family, not to
maximize the number of cognates, but to avoid colexifications within the
same list (between "son" and "boy", or between "sun" and "day" etc). In the
case of interrogatives, in a language where all interrogative pronouns
share a common root (like wh- words in English), you really don't want to
include more than one of them in your list.

Especially in closely related languages, it's very hard to distinguish
> lexical loanwords from inherited words, isn't it? (For example, Dutch
> begrijpen 'understand' is said to have been borrowed from German
> https://wold.clld.org/word/72181920155924122, but without the rich
> attestation of both languages since the Middle Ages, we wouldn't be able to
> tell.)
>
Phylogenetic methods don't do miracles: your phylogeny is only going to be
as reliable as your degree of knowledge on the history of each language. If
we want better phylogenies, we first and foremost need more
fieldwork/philological research and more work on etymologies in all
languages families, and more specialists in this field. This is why
historical linguists should all embrace these methods ... 😀

"But then what does this mean when you take one language from a family like
> Austronesian with ~1300 languages and a one from a family like Eastern
> Trans-Fly with 4 languages. This means that you've sampled 0.0007% of
> Austronesian but 1/4 of ETF. This feels wrong."
>
> It doesn't feel wrong to me at all, just as it doesn't feel wrong to treat
> large languages like Russian in the same way as small languages like
> Sorbian. They have many more speakers, but these speakers are not
> independent of each other; in the same way, Austronesian speakers are not
> independent of each other, so a genealogically stratified sample would have
> only one Austronesian language (one that is at least 30 languages away from
> Papuan languages).
>
We can see this problem from several perspectives. In my view,  when we
talk about "unrelated" families, we just mean families between which we
cannot identify shared cognate material with enough certainly, but it seems
to me unlikely that language arose 200+ times in the history of the human
species, and thus that many language families (if not all of them) have a
common ancestor. Thus, even if you only pick language isolates in your
sample, you cannot claim complete genealogical independence.

Thus,  genealogical "independence" between languages is a matter of degree
rather than nature. In dated linguistic phylogenies, a measure of the
degree of divergence would be the age of separation: for instance a
2000-year old family (Romance) vs 4000-year old family (Indo-Iranian).
Unfortunately, at the present moment these datings are difficult to compare
across linguistic phylogenies (and thus across language families), but
eventually I think this is the way to go. Unrelated languages could be
assigned a separation age > 12000 BP for instance (maybe Simon has a better
suggestion here).

Best wishes,

Guillaume

-- 
Guillaume Jacques

Directeur de recherches
CNRS (CRLAO) - EPHE- INALCO
https://scholar.google.fr/citations?user=1XCp2-oAAAAJ&hl=fr
https://langsci-press.org/catalog/book/295
<http://cnrs.academia.edu/GuillaumeJacques>
http://panchr.hypotheses.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lingtyp/attachments/20231107/cf32d626/attachment.htm>