Assumptions in Computing phylogenies

Tue Feb 15 16:46:06 UTC 2000

Thanks to Sean Crist for once again clarifying what the UPenn group
is doing.
I believe some further clarifications are in order about possible questions
which may be raised, without claiming to be an expert on this.
I may have gotten something quite wrong.

I am copying this message to Sean Crist because a later message
from him says that he will not be receiving IndoEuropean list
messages automatically from now on for a while.

I am agreeing with Sean on terminology, on the automaticity of the
algorithm and the fact that if it contains a random element,
then getting the same unrooted tree result again and again indicates
some stability of the tree, etc.  Not in discussion here.
I am more concerned with problems raised by specialists in
biological classification about the technique itself,
without implying we need to take everything from biologists.
(See other discussions recently.)  I am also glad the UPenn team
is doing this (computers can help us), at the same time that I am
always skeptical about assumptions getting hidden when computers
are used.  There are simply so many examples of this happening.
How to fool oneself unintentionally with statistics, etc.
Anyhow, onwards:

[SC]
>For a given set of coded characteristics for the 10
>taxa ("is a vertebrate", "has a beak"), some of these trees will score
>better than others.  The problem is to find the one which scores the best,
>and the only way to compute the problem deterministically is to compute
>the score for _every one_ of these trees.

Even computing it deterministically might not give the best answer.

An important question is the degree to which using the measure
"scores the best" will in fact yield the closest to the TRUE tree
of the historical splits which occurred in the history of PIE
and its descendants.
That is to say, the method and the "scores" used must be *evaluated*
against their success judged by other means.
They are not themselves the judge of other considerations.
They may help us to gain more insight,
and a computer can handle much more computation than we can
do by hand, so even given that the full computation
Sean Crist refers to would take too long to actually carry out,

>This would just take too long
>over any data sets of larger than trivial size.

the technique can still be useful.  But the question remains,
how closely does the "scoring" system favor a TRUE tree.

I am *not here* questioning whether dialect-networks is a better
model than a tree, that is a separate question.  I am here only
concerned with whether the tree produced by *this* technique
is the best *tree* possible.  The "least strain" on the model
may properly correspond to the best score if the scoring system
is designed ideally.
The result of this technique is by definition not a dialect net.

I would be happier if we had a technique that could give results
as a combination of dialect net and tree, and assume if we had
such a technique, it could handle Germanic better, both as to its
place within IE and as to the dialect divisions within it.
But such a new technique might, in some versions,
make fewer suggestions about tree splits,
simply avoiding them in difficult cases.
That might be good sometimes (i.e. when the result was true),
but bad other times (i.e. when not true).

[SC notes the impossibility of doing complete deterministic computations,
and that]

>In the 1990's, M. Farach, S. Kannan, and T. Warnow worked out
>a way of partly getting around the problem.
>The mathematics of their algorithm are beyond me,
>but as is often the case, you don't have to understand the
>internals of an algorithm to be able to understand
>what it's computing and to be able to use it.
>The practical characteristics of their algorithm are as follows:
>1) If the characters allow a perfect phylogeny, the algorithm
>will return it.  2) If there is no perfect phylogeny, the algorithm will
>return a pretty-good tree, but not one which is guaranteed to be the
>best-scoring one out of all the possible trees.  However, since the
>algorithm involves a random element,
>you can repeatedly run the algorithm, and if the same tree
>keeps coming up, that's a good indication of the tree's reliability.

Then
>Don Ringe and Ann Taylor (both Indo-Europeanists)
>got together with Tandy Warnow,
and applied this method to the family tree of the IE languages.

***

Now on to my primary questions:

According to one of our correspondents (Stanley Friesen?),
the biologists have found that this (?) technique is not highly
robust, is subject to artifact effects in several ways,
and that the UPenn team have not taken account of these.
I have no knowledge to express an opinion on whether they have
or have not, but believe these questions should be
addressed publicly and clearly.
Perhaps they have been, in which case I will appreciate
being referred to sources.  Perhaps these are among the issues
to be more fully explained in a publication in preparation,
in which case I will just have to wait, though some sketchy
explanations in advance of publication would be helpful.
Here are two such claims I think I have seen about artifact effects:

(a) results are highly sensitive to the choice of initial characteristics
(b) results may be systematically biased by the technique
     (what someone referred to as the "long branch" attraction effect,
     if I remember correctly)

And here is a third one I raised recently as a question,
and do not think the one response I received got me further
in my understanding:

(c) are results sensitive to whether a dialect in a dialect net
is near the center, surrounded by closely related languages,
with many nearby characteristics to compare,
or near the periphery, surrounded by unrelated languages or isolated,
with fewer nearby characteristics to compare?
Will these different positions influence results expressed as trees
in ways they should not?  (That is to say, peripheral dialects
may split off or innovate earlier; or they may fail to follow innovations
spreading from another part of the dialect network; two quite opposite
possibilities.  Is the technique biased in these respects?)

***

Here is Sean's response to a different question, but relevant to (a):

>...the team's work is "mainly based on prior scholarship".
>It's quite true that the team
>drew on the collective knowledge of the IE scholarly community in
>coming up with the character list, much as a biologist might refer to
>already-published descriptions of various species in coming up with a
>character list for the purpose of computing the evolutionary family tree
>of those species.  It's obvious that they should do so; they are not
>working in a vacuum, and it would be perverse to ignore what we already
>know.

My own observation  (at a lecture by Ringe at the Smithsonian Institution
some years ago) was that Ringe expressed "surprise" that the
results of using the technique were highly consistent with traditional
scholarship.  I found that expression of surprise itself surprising,
since one would certainly expect that if traditional comparativists
had done their job decently and if the UPenn team had done their job
decently.  But it also made me wonder why Ringe was so strongly
emphasizing the superiority of the UPenn technique as compared with
previous work.  Perhaps simply everyone tends to view their own
work as important.  I am glad the UPenn team is doing this, and am
certain it will at least raise questions which may have been overlooked,
and by virtue of using a computer may be able to check some hypotheses
which were not previously checked.  See the next section below.

But results *do* quite properly depend crucially both on the choice of
characteristics included in the data and on the interpretation of
those characteristics, both in prior scholarship.
So there is a sense in which results are partly built in by the selection
of characteristics and the interpretation as innovations vs. retentions.
This will seem quite proper if one agrees with the conclusions built in,
and not if one does not agree with some of them.  Presumably
traditional scholarship has done its work well, but in that case the
results of the UPenn technique really do depend in essential ways
on traditional scholarship, the technique cannot question those
earlier results which it treats simply as facts, as data.

***

Where does the UPenn work fit in a longer view of development
of our field?  Here is how I see it, as one step in a long chain of steps
(as so far presented).

Personally, I do think that in order to evaluate whether we think
the results of this technique are TRUE (valid, not merely repeatable),
we will need a more advanced method which can yield results as a
mixture of dialect network and tree structure, at least that,
and we will need a much larger amount of data,
enough data so that a traditional comparative linguist would be able
to identify fairly easily by scanning the data tables of actual forms
cases in which there is borrowing or areal influences rather than
family-tree phenomena, and the reverse.

At that point I would begin to have some confidence that the technique
can be applied to more difficult cases with less than optimal
quantities of data, and we could begin to measure how reliable are the
results of using kinds of subsets of data rather than complete data,
then even deriving probability estimates for cases in which we CANNOT
have ideal complete data because the time depth is too great.

The other area where I think computerization can be most helpful
is in developing automatic techniques for detecting assumptions
which we need to question, but such usually appear obvious
after the fact so we forget them.  A good historian of IE studies
would know of many trails of analysis which turned out wrong,
which could be added here.

Consider these two, each embodying the kind of assumption
that is often built in unconsciously, perhaps the kind of assumption
to which some of our correspondents might be referring.
I mention these two simply because they have been of interest to me.

If our grouping (say in a dialect net) of gradient centum-satem
characteristics relied on an assumption that
*k' > c^ > ts > "th",
the computer might suggest trying an alternative assumption
*k' > ts > {c^ or "th"}.
I asked about this in a separate message yesterday,
giving phonetic analysis to suggest the second is more realistic,
more common, a simpler explanation, whatever.

Or, for the Chinese languages, I once read Karlgren's detailed work
very carefully, and noticed a great rotation of the vowel space
was used as the sound correspondence between two sets of
languages in the Chinese family.  Karlgren took one form of
the vowel space as more original, a second form as derived
by rotation of the vowel space.  What if we reversed that,
took the second form as more original, the first form as derived
by a rotation of the vowel space in the opposite direction?
How would that affect the family tree or the dialect network
for Chinese languages?  (There may be good reasons to reject
such an alternative assumption, but the question can at least be
asked explicitly, and a computer might help to force such
assumptions to our conscious level.)

***

One of our correspondents has mentioned that the traditional
neogrammarians in fact checked their new hypotheses against
diagrams in which the sound changes and the vocabulary items
affected (?) were displayed on the branches (?) of trees.
I would like some detailed references (title, author, *page* in volumes,
etc.)
to see examples of these.

Perhaps, despite our much greater knowledge today,
this would still be useful, so that we recognize that our knowledge
is always in progress, that we keep its logic always available
to us on the surface as much as our display techniques
allow us to do so,
rather than visually displaying only results without the ability
to delve into the reasoning and question them freshly,
when either new data *or* new perspectives come along.

***

Best wishes,
Lloyd Anderson
Ecological Linguistics
PO Box 15156
Washington, DC 20003