SV: Indo-Hittite

Fri Feb 4 04:36:08 UTC 2000

At 02:30 PM 2/2/00 +0000, Larry Trask wrote:
>Stanley Friesen writes:

>>  So far the "cladistic linguistics" I have seen has fallen far short of what
>>  biologists do - many of the solutions to statistical issues that biologists
>>  have come up with are not applied.

>But our problems are not identical to those of the biologists, and their
>solutions do not necessarily work for us.

To some extent.  But things like long-branch attraction are *certainly*
potential trouble-spots in linguistics as well as in biology.  And the
UPenn trees were done in a way that is particularly prone to that
particular malady.  So simply ignoring the work of the biologists is not
the way to go.  What must be done is an analysis of the applicability and
utility of each process.  (Also, I am not certain the UPEnn people paid
enough attention to the issue character selection and its potential for
introducing a bias in the results: character selection is ultimately
necessary, but must be done with extreme care to avoid bias).

>For one thing, the biologists have a lot more material to work with than
>we do.  They have genes, but we don't.  They have fossils, but we mostly
>don't.

These are relatively minor points.  In many cases neither has been
available to biologists either.

>It is, in my view, an error to assume that comparative linguistics is
>isomorphic to biological taxonomy, and that what is true or successful
>in one field must be true or successful in the other.

I am not making *that* assumption.  But what I have read of the papers from
UPenn show a lack of awareness of even the most basic precautions needed to
make cladistic analyses truly reliable - precautions against potential
problems that are intrinsic to the *method*, and do not depend on the realm
of application.  Long branch attraction is a mathematical feature of basic
model, and sampling issues are fundamental to any mathematical analysis,
but are especially important when one is doing statistical analyses (which
is what cladistics is).

Some while back I posted a *long* article on the weaknesses in the method
described  in the one paper I have analyzed in detail.  If it is not
available in archives somewhere, I could send it to you directly.  (I will
not repost it here, unless there is mass demand for it).

>As for statistical (probabilistic) approaches, some linguists have been
>trying very hard to develop these, but the difficulties are considerable,
>indeed almost refractory, and so far no one has been able to come up with
>a probabilistic approach which can be regarded as generally satisfactory.

The same is true in biology.  In fact I have, on several occasions,
discussed the problem of statistical significance in cladistic analysis in
the dinosaur mailing list.  As yet this has only been solved for gene
sequence analysis.  It is an unsolved problem for character based analysis.
 Even with my training in statistics I have been unable to come up with a
model that can be used to compute significance statistics for comparing
cladograms that differ by only a few steps.

I am not complaining about *that* here.  Indeed if *that* were the only
problem I saw in the UPenn trees, I would consider them well established.

My biggest beef is that, due to the way they did the analysis, few, if any,
of the major branches of the tree can be said to be clear of long-branch
attraction, making the basal branching sequence dubious, at best.

This is why, in another post, I said I would have been more confident in
their results if they *had* used Luwian to help test the Indo-Hittite
hypothesis.

--------------
May the peace of God be with you.         sarima at ix.netcom.com