Assumptions in Computing phylogenies

Fri Feb 25 06:21:40 UTC 2000

At 11:46 AM 2/15/00 -0500, ECOLING at aol.com wrote:

>According to one of our correspondents (Stanley Friesen?),
>the biologists have found that this (?) technique is not highly
>robust, is subject to artifact effects in several ways,
>and that the UPenn team have not taken account of these.
>...
>Here are two such claims I think I have seen about artifact effects:

>(a) results are highly sensitive to the choice of initial characteristics

This is true, and some doubt about their selection of characters has been
expressed.  However, I am not sure how important that is in this case.

>(b) results may be systematically biased by the technique
>     (what someone referred to as the "long branch" attraction effect,
>     if I remember correctly)

Actually, these are two different things.  Long branch attraction is a
general problem with deeply branched sparse trees, across virtually all
known techniques.  Biologists are still struggling with this one, with no
final answers yet.  Basically, it is hard to recover the *true* branching
order for deep branches in the absence of very early sub-branches.  (This
is probably one of the reasons why the relationships of the animal phyla
are so hard to determine).

But technique can also be biasing.  (And I have some issues even with the
more common techniques used in biology).

>(c) are results sensitive to whether a dialect in a dialect net
>is near the center, surrounded by closely related languages,
>with many nearby characteristics to compare,
>or near the periphery, surrounded by unrelated languages or isolated,
>with fewer nearby characteristics to compare?
>Will these different positions influence results expressed as trees
>in ways they should not?  (That is to say, peripheral dialects
>may split off or innovate earlier; or they may fail to follow innovations
>spreading from another part of the dialect network; two quite opposite
>possibilities.  Is the technique biased in these respects?)

This could actually be considered a special case of the same basic problem
as long branch attraction.

>But results *do* quite properly depend crucially both on the choice of
>characteristics included in the data and on the interpretation of
>those characteristics, both in prior scholarship.
>So there is a sense in which results are partly built in by the selection
>of characteristics and the interpretation as innovations vs. retentions.

Actually, properly done, cladistic analysis *determines* which characters
are innovations and which are retentions.  That is one of its real powers.
The distribution of the characters determines the tree topology, with the
character transitions placed on the branches between nodes.  Then one uses
some method to determine the root of the tree.  Now, each transition is an
innovation as one moves away from the root.  Any character lacking a
transition rootwards is retained.

Actually, this reminds me of another problem with the UPEnn tree!  They
effectively *assume* the placement of the root.  They do not use any of the
various established means for locating the root.  Unfortunately the most
powerful method, outgroup comparison is not available in linguistics at
this time depth (unless one accepts a relationship of PIE with the Uralic
languages or some such thing).

[The rooting problem is why many biological trees based on gene analysis or
protein comparison are presented in "rootless" form, as seen in the recent
article in Scientific American on the relationships of the various
eukaryotic and prokaryotic groups].

--------------
May the peace of God be with you.         sarima at ix.netcom.com