Contributions by Steve Long

Sean Crist kurisuto at unagi.cis.upenn.edu
Thu Sep 30 16:19:30 UTC 1999


Your post is so very long that I am only going to respond to a few major
points.  It's really hard to wade thru a post when it gets this long.

On Mon, 27 Sep 1999 ECOLING at aol.com wrote:

> Point 1.

> Family Tree representations can be much improved.

> Steve Long's contributions have shown that the types of family tree
> diagrams which we are accustomed to using in our tradition DO NOT
> adequately represent even the locations of innovations on which such
> trees are by all accounts theoretically based, not even whether
> innovations occur or do not occur on particular branches!  I have noted
> details in a previous message, suggesting what we might do about this.
> I will certainly think twice before producing or reproducing further
> family trees which do not show such information!

You can certainly take a rooted language phylogeny and find some way to
graphically respresent the innovations which have occurred on each branch.
A caution, tho: if you really want to show every tiny innovation (all the
way down to neologisms coined within a branch, specific analogical changes
within paradigms, etc.), you're going to need a sheet of paper the size of
Texas.

If you want, you can just include the "best hits"  of the innovations in
each branch, but at this point, your representation is already an
abstraction.  This is what a phylogeny is: an abstraction over a mass of
data which allows us to see the big picture.

> Point 2.

> Parent and daugher languages can indeed in theory co-exist,
> exactly as Steve Long said.

> I use here the definition of distinctness of "languages" preferred by
> most linguists, including the experts on this list, that is, fuzzily,
> "forms of speech which are mutually unintelligible".

If you want to define things in those terms, you're free to do so.
At this point, however, we're no longer speaking on the same subject,
since the tree I cited is concerned with shared or unshared innovations,
not with mutual intelligibility, which is a different matter.  I stick by
my claim that the following situation cannot exist among natively learned
languages:

          A
         / \
        A   B

where the right branch has innovated and the left has undergone no change
whatever.  This is true whether A and B are mutually intelligible or not.
I am not interested in debating this point any further.

> Point 3.

> Computer algorithms may have their conclusions partly built into the data.

> Steve Long's contributions have led to acknowledgements in the most
> recent messages, when the "characters" were actually specified for the
> first time within these discussions, that the "characters" used in the
> UPenn tree for the IE languages include in several cases specifications
> of innovations vs. retentions, and thus may build into the data a part
> of the conclusions, or at least strong biases towards them.

It is absolutely not the case that the input to the algorithm included
information about what was an innovation and what was a retention.  I'm
surprised that there is still misunderstanding on this point.  The input
comprises codings of particular characteristics of the languages, and
nothing more.

> That was
> one of Steve's early assertions, and in that he has been shown to be at
> least partly correct.  If this had been acknowledged forthrightly from
> early on, if the "characters" had actually been listed in the discussion
> then, we would have had far fewer words expended, and people would have
> become less frustrated with each other.

When I first cited the Ringe-Warnow-Taylor phylogeny on 8 Aug 1999, I said
the following:

  [...] They used an algorithm developed to produce optimal phylogenies of
  biological species (e.g.  you might code "vertebrate" as "1" and
  "non-vertebrate" as "2", etc.  The computational problem, which is quite
  difficult, is to compute the correct phylogeny, or family tree, for the
  species being considered).  What was new in their approach was to use
  this methodology to produce a phylogeny of a family of languages (e.g.
  perhaps you would code Indo-Iranian with a "1" to mean "undergoes the
  RUKI rule", and Italic with a "2" to mean "doesn't undergo RUKI"). [...]

So from my very first post, I had explained what kind of input the program
took.  Apparently my explanation was too abstruse for some, so I clarified
it later.


> (That is not to say the judgements of innovations vs. retentions
> included in any of those "characters" are wrong, simply that they are
> already there, they are not generated independently as the result of the
> computer algorithm.  It can be a very difficult task to investigate
> whether there has in effect been any distorting bias resulting from
> this, but the potential is clearly there on the surface.)

> Given this point, it becomes much more difficult to determine what it is
> that the UPenn algorithm has contributed beyond the results of decades
> of work by very accomplished traditional experts in Indo-European
> linguistics.

When you have 12 different species or languages, there is a total of
34,459,425 different possible phylogenies characterizing the relations
between them.

Texts on IE have traditionally represented the IE family as a star-shaped
phylogeny, i.e. with a single PIE node at the top, and all of the major
branches as daughters of this single node.  That's one of the 34,459,425
possible phylogenies.  Nobody ever really believed that this was the
correct phylogeny (i.e., we don't think that the PIE speakers packed up
and set off in a dozen different directions at the same time), but the
problem of determining which of the 34,459,424 alternatives is the right
one is such a computationally difficult problem that there was no time
before the present when we had the technical ability to do so.

Of course, people had noticed that there were certain things shared by
Italic and Celtic; it had certainly been previously proposed that
Anatolian was an 'aunt' to the other IE languages, etc.; but there was
previously no way of pursuing the question in a systematic and unified
way.  This is the contribution which Ringe, Warnow, and Taylor have made.

> The real question which remains for me (judging only by the recent
> discussions here and the public presentations I have seen) are whether
> the choice of WHICH "characters" to use does not already imply what
> conclusions will be drawn.

For whatever it's worth, Don Ringe was previously on record as not
believing the Indo-Hittite hypothesis (i.e., that Anatolian is an "aunt"
to the other IE languages), but he changed his view as a result of the
output of the algorithm.

> With a deterministic algorithm
> (as opposed to a partly randomizing one), that should of course be the case.

Good luck.  The problem is an NP complete one.  If you can work out a
deterministic way of computing a phylogeny from characters in shorter than
exponential time, you will have solved a major problem in mathematics and
computer science.  As it works out, by the same stroke, you will have also
rendered obsolete all of the cryptographic methods currently in use, and
certainly assured yourself a prominent place in history.

> So our attention really needs to be directed to those choices.
> Especially since the addition of one "character" changed the family tree
> as it affected the positionof Celtic and Italic.

You obviously missed my post where I corrected myself on this point.
There were actually two characters added which were crucial to forcing
Celtic and Italic into a single sub-branch.  Even before those two
additional characters were added, Italic and Celtic were always next to
each other on the tree; and on two of the best runs, an indeterminate
structure was produced for which one resolution was an Italo-Celtic
grouping.  So the tree was already on the verge of an Italo-Celtic
grouping before these additional characters were added.

I should also add something else.  Consider the superlative.  Anatolian
and Tocharian have no superlative; Italic and Celtic have *-issim-, and
all the other IE languages have *-ist- (I'm probably not remembering the
reconstructions just right, but they're along these lines). We might
naively take this as evidence for an Italo-Celtic grouping, as some have
done.  When you look at the overall tree, however, we see that this isn't
so: it could be that the common ancestor of Greek, Armenian, Balto-Slavic,
Germanic, and Indo-Iranian used to have *-issim- and then innovated by
changing it to *-ist-.  However, once we have concluded on independent
grounds that there is an Italo-Celtic grouping, a lot of facts like those
of the distribution of the superlative fit the picture very nicely.

I will say this: you are quite right that the tree produced is a product
of the characters which one selects.  If you feel that the team erred in
the specific characters they included, this would be a good place for you
write a criticism: you could argue why your choices are better, and show
how they lead to a different result.  However, I reject any suggestion
that the team had a particular solution in mind ahead of time which biased
their selection of characters.

> I am a researcher, son of two researchers,
> and a PhD linguist who in parts of my life
> does historical and comparative linguistics.

If you don't mind me asking, where did you do your doctoral study?

  \/ __ __    _\_     --Sean Crist  (kurisuto at unagi.cis.upenn.edu)
 ---  |  |    \ /     http://www.ling.upenn.edu/~kurisuto/
  _| ,| ,|   -----
  _| ,| ,|    [_]
   |  |  |    [_]



More information about the Indo-european mailing list