<div dir="ltr"><div dir="ltr">Dear Martin,</div><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>As you say, the reliability of these studies hinges on the
cognate coding, which is done manually, by humans with their
biases. I'm wondering if there is a way to measure the degree to
which different linguists agree or not (by some kind of kappa
statistic), and a way to identify or exclude systematic biases
(which are part of normal human behaviour). </p></div></blockquote><div>This is not something that has been done systematically in the past, but indeed a measure of inter-annotator agreement should be done in future research on phylogenies (when large teams are involved).</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p>Another thing that I
worry about is that grammatical markers (even demonstratives and
interrogatives) are ignored (see the list of 170 comparison
meanings in IE-COR: <a href="https://iecor.clld.org/parameters" target="_blank">https://iecor.clld.org/parameters</a>), even
though we know that these are the most resistant to borrowing.
</p></div></blockquote><div>Demonstratives and interrogatives are part of Swadesh's original 200 word list, and some studies include them (for instance our <a href="https://www.pnas.org/doi/10.1073/pnas.1817972116">2019 Sino-Tibetan phylogeny</a>). In most families, it is actually surprisingly difficult to deal with these forms. The problem is that the phylogenetic methods that we use require the etyma in the various lexical meanings to be independent from each other: this is the reason why the word lists must be tailored to each family, not to maximize the number of cognates, but to avoid colexifications within the same list (between "son" and "boy", or between "sun" and "day" etc). In the case of interrogatives, in a language where all interrogative pronouns share a common root (like wh- words in English), you really don't want to include more than one of them in your list.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><p> Especially in closely related languages, it's very hard to
distinguish lexical loanwords from inherited words, isn't it? (For
example, Dutch begrijpen 'understand' is said to have been
borrowed from German <a href="https://wold.clld.org/word/72181920155924122" target="_blank">https://wold.clld.org/word/72181920155924122</a>,
but without the rich attestation of both languages since the
Middle Ages, we wouldn't be able to tell.)</p></div></blockquote><div>Phylogenetic methods don't do miracles: your phylogeny is only going to be as reliable as your degree of knowledge on the history of each language. If we want better phylogenies, we first and foremost need more fieldwork/philological research and more work on etymologies in all languages families, and more specialists in this field. This is why historical linguists should all embrace these methods ... 😀</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
<p><span style="white-space-collapse: preserve;">"But then what does this mean when you take one language from a family like Austronesian with ~1300 languages and a one from a family like Eastern Trans-Fly with 4 languages. This means that you've sampled 0.0007% of Austronesian but 1/4 of ETF. This feels wrong."</span><br></p>
<p><span style="white-space:pre-wrap">It doesn't feel wrong to me at all, just as it doesn't feel wrong to treat large languages like Russian in the same way as small languages like Sorbian. They have many more speakers, but these speakers are not independent of each other; in the same way, Austronesian speakers are not independent of each other, so a genealogically stratified sample would have only one Austronesian language (one that is at least 30 languages away from Papuan languages).</span></p></div></blockquote><div>We can see this problem from several perspectives. In my view, when we talk about "unrelated" families, we just mean families between which we cannot identify shared cognate material with enough certainly, but it seems to me unlikely that language arose 200+ times in the history of the human species, and thus that many language families (if not all of them) have a common ancestor. Thus, even if you only pick language isolates in your sample, you cannot claim complete genealogical independence. </div><div><br></div><div>Thus, genealogical "independence" between languages is a matter of degree rather than nature. In dated linguistic phylogenies, a measure of the degree of divergence would be the age of separation: for instance a 2000-year old family (Romance) vs 4000-year old family (Indo-Iranian). Unfortunately, at the present moment these datings are difficult to compare across linguistic phylogenies (and thus across language families), but eventually I think this is the way to go. Unrelated languages could be assigned a separation age > 12000 BP for instance (maybe Simon has a better suggestion here).</div><div><br></div><div>Best wishes,</div><div><br></div><div>Guillaume</div><div><br></div></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Guillaume Jacques</div><div><br></div><div>Directeur de recherches<br>CNRS (CRLAO) - EPHE- INALCO <br></div><div><a href="https://scholar.google.fr/citations?user=1XCp2-oAAAAJ&hl=fr" target="_blank">https://scholar.google.fr/citations?user=1XCp2-oAAAAJ&hl=fr</a><br></div><div><a href="http://cnrs.academia.edu/GuillaumeJacques" target="_blank">https://langsci-press.org/catalog/book/295</a></div><div><div><a href="http://panchr.hypotheses.org/" target="_blank">http://panchr.hypotheses.org/</a></div></div></div></div></div>