Computing phylogenies (was: Re: Indo-Hittite)

Fri Jan 28 17:14:37 UTC 2000

On Wed, 26 Jan 2000 X99Lynx at aol.com wrote:

> Since it is mainly based on prior scholarship and selected PIE
> reconstructions, I suspect there is nothing particularly novel about the
> UPenn tree findings.  It's clear that the methodology is originally designed
> to find the best possible internal consistency for particular theories,
> rather than testing the theories themselves.

*Sigh* You still haven't understood the significance of this work.  Let me
summarize again, partly for you and partly also for the benefit of those
who weren't on the list when we went of this last summer.

Computing a phylogeny over character-based data is a very difficult
computational problem.  Together with the classic "travelling salesman"
problem, it belongs to the class of NP (non-polynomial) complete problems.
I could discuss exactly what this means, but what it means in practical
terms is that computing the problem in a deterministic manner would take
an impractical stretch of time, such as twenty years or a million years,
even with the fastest computers.

If you have 10 taxa (species, or languages in this case), there are
34,459,425 possible trees (17 x 15 x 13...) representing the relations
among those taxa.  For a given set of coded characteristics for the 10
taxa ("is a vertebrate", "has a beak"), some of these trees will score
better than others.  The problem is to find the one which scores the best,
and the only way to compute the problem deterministically is to compute
the score for _every one_ of these trees.  This would just take too long
over any data sets of larger than trivial size.

This is too bad, because there are obvious applications for being able to
compute a phylogeny over character data.  The best known example is that
biologists would like to compute the evolutionary family tree for
biological species based on the characteristics of the species.

In the 1990's, M. Farach, S. Kannan, and T. Warnow worked out a way of
partly getting around the problem.  The mathematics of their algorithm are
beyond me, but as is often the case, you don't have to understand the
internals of an algorithm to be able to understand what it's computing and
to be able to use it. The practical characteristics of their algorithm are
as follows: 1) If the characters allow a perfect phylogeny, the algorithm
will return it.  2) If there is no perfect phylogeny, the algorithm will
return a pretty-good tree, but not one which is guaranteed to be the
best-scoring one out of all the possible trees.  However, since the
algorithm involves a random element, you can repeatedly run the algorithm,
and if the same tree keeps coming up, that's a good indication of the
tree's reliability.

Don Ringe and Ann Taylor (both Indo-Europeanists) got together with Tandy
Warnow, and the team had the bright idea of using this algorithm, which
was originally developed for biologists, to compute the phylogeny of the
IE language family.  This has never been done before.  Because the
technology to do this didn't previously exist, it _couldn't_ have been
done before. Your statement that "there is nothing particularly novel"
about this work is simply a misunderstanding about what has been done
here.

The first thing the team had to do was to come up with a data set encoding
various characteristics of the IE languages.  You state that the team's
work is "mainly based on prior scholarship". It's quite true that the team
drew on the collective knowledge of the IE scholarly community in coming
up with the character list, much as a biologist might refer to
already-published descriptions of various species in coming up with a
character list for the purpose of computing the evolutionary family tree
of those species.  It's obvious that they should do so; they are not
working in a vacuum, and it would be perverse to ignore what we already
know.

As for your statement "It's clear that the methodology is originally
designed to find the best possible internal consistency for particular
theories, rather than testing the theories themselves", I'm not sure what
you're saying here.  Perhaps you mean that if one chose a different
character set, the algorithm might compute a different optimal tree.
This is true, but as I've said before: another competent Indo-Europeanist
might come up with a slightly different character set, but it could not be
very different.  There is only so much data to work with, and many of the
interpretations of the data are non-controversial.  Still, if you feel
that the character set could be improved, you are free to do so and to set
your results against those of Ringe, Warnow, and Taylor.  This would be a
very good thing to do.

> As you say, adjustments were made in the data to give a relative chronology
> to the tree afterwards.

This looks like a garbled version of what the team actually did. _After_
the team produced the unrooted phylogeny and made a claim for where the
root should be, they made a diagram showing the century where each
language was first attested, and plotted the phylogeny against those
dates.  This involves graphically stretching the tree out a bit, but it
doesn't involve any changes to the actual structure of the tree. Producing
a diagram of this sort allows us to make some statements about absolute
dates before which particular branchings must have occurred.

For example, the team claims that the nearest relative of Greek is
Armenian.  Since a clearly differentiated Greek is attested in the middle
of the second millenium BCE, it must be the case that the Greco-Armenian
branching took place _before_ that date.

> Some of these adjustments were geographical and relate to presumed
> contact or lack of it.

They made no such adjustments.  The phylogeny was computed strictly based
on what is actually found in the languages in question, without regard to
geography.

> There is no indication that the UPenn group ever hypothesized or
> attempted to execute a tree where Hittite and PIE were hypothesized to
> be sisters decended from a common ancestor to see if it was also
> consistent with the data.

You're misunderstanding their methodology if you believe that they would
"attempt to execute" a particular tree.  They feed the character data into
the algorithm and take the tree that comes out.

Second, the tree you describe is in fact the phylogeny which the team
arrived at: they do believe that Anatolian is a sister of
Proto-Everything-Else, which you refer to as "PIE" (and Steven, I am
determined to ignore into the ground any objections over the labeling of
the nodes).

> Such a procedure might have been methodologically necessary to
> properly 'test' the IndoHittite hypothesis.  These findings also might
> have carried assumptions that Hittite was a descendent of PIE (e.g.,
> some of the established reconstructed PIE that Ringe, et al used to
> identify cognates for co-categorization presumably were reconstructed
> with Hittite included in the comparative data) and the computer
> confirmed that this was not inconsistent - a finding that could have
> also possibly come out of hypothesizing Hittite as being a sister
> language.

Let me say it again: the phylogeny which the team arrived at is as
follows:

        X
       / \
      /   \
     Y    Anatolian

I'm referring to X as "PIE" and to Y as "Proto-Everything-Else".  You're
referring to Y as "PIE" (so I guess X would be "Proto-Indo-Hittite" in
your nomenclature).  From the perspective of the team's methodology, the
two claims are identical and indistinguishable.  Again, this is purely a
matter of what we arbitrary labels we choose for the nodes, and I am
determined to ignore any argument about it.

Also, let me respond to someone else:

> In a message dated 1/26/00 11:23:23 AM, Hans J Holm wrote:

> <<It must be noticed that the starting point or so called 'root' in the
> Ringe/Warnow tree is /not/ calculated, but inserted by Ringe as outcome of
> traditional, mainstream views perhaps too much preoccupied by the only
> early documentation of Hittite. >>

(Responding to Hans Holm): Yes, it's quite true that the algorithm that
Ringe, Warnow, and Taylor used produces an unrooted phylogeny.  It would
be a possible and coherent point of view to accept their unrooted
phylogeny but to argue for some other point in the phylogeny as the root.

  \/ __ __    _\_     --Sean Crist  (kurisuto at unagi.cis.upenn.edu)
 ---  |  |    \ /     http://www.ling.upenn.edu/~kurisuto/
  _| ,| ,|   -----
  _| ,| ,|    [_]
   |  |  |    [_]