Austronesian tree in _Nature_

Tue Jul 18 15:52:34 UTC 2000

----------------------------Original message----------------------------
I am grateful to Michael Cysouw for bringing the two articles
in Nature to our attention.  I've just rushed off to the library
and read them, and now I'd like to offer a few comments.

My first reaction to the articles is not as negative as Michael's,
but I agree that there are points of concern.

Gray and Jordan do not claim that their method can supplant
any existing linguistic methods.  Instead, they appear to see
their method as supplementing, or complementing, our existing
methods, and in particular they express hope that their method
might be useful in integrating data from linguistics, genetics,
archaeology and ethnography.  Note in particular that G&J are
not merely interested in constructing a linguistic family tree:
they want to find a linguistic tree which is highly consistent
with the archeological evidence for the Austronesian expansion.

They do, however, express surprise that historical linguists have
not made more use of best-tree algorithms.  They seem unaware of the
existing work by linguists with this technique, notably (perhaps)
at Pennsylvania and at Cambridge.  They also seem unaware of the
formidable difficulties which linguists have pointed to in getting
best-tree methods to do anything useful.  Like many non-linguists,
they seem to take it for granted that linguistic data are not
significantly different in nature from (say) genetic data, and
that the same techniques can be applied successfully to all these
data.  Well, this has yet to be demonstrated.  Of course, G&J's
paper constitutes an attempt at just such a demonstration, but
we'll see.

The enthusiasm here is expressed by Cann, who writes the following:

        "A systematic tool that could reveal hidden subgroups among
        similar Austronesian languages would be a powerful way of
        analysing Pacific prehistory."

But G&J do not identify any "hidden subgroups", nor do they claim
to have identified any.  More on this below.

Moreover, G&J acknowledge the importance of the Austronesian
work already done by linguists, though perhaps not as explicitly
as they might have.  But the article is written in the dense and
compressed style of a scientific publication, and it is much
more concise than what we are used to seeing in a linguistics journal.

G&J make the following assumptions.

(1) The Austronesian family exists, and its membership is known.

        This, of course, results entirely from linguistic work.

(2) Cognate lexical items exist, and very many have been identified.

        This too results from linguistic work.

(3) There is no doubt about whether two lexical items are cognate
or not.

        This again results from linguistic work, and specifically from
        Bob Blust's data, the sole source of data used by G&J.  And
        the possible difficulties here have already been noted by
        Michael.

(4) Linguistic family trees can be adequately established on the
basis of lexical cognates alone, without reference to phonology,
morphology or syntax.

        G&J use only lexical items as data.  Members of this list
        will have their own views on this matter.  Cann, in her
        comment, asserts flatly that words are similar to genes.
        Members will also have their own views on this assertion,
        not overtly made by G&J, but implicitly assumed by them --
        though G&J do assert that languages are like molecules.

(5) No weighting is necessary.

        As far as I can tell, G&J have assigned no weightings, but have
        treated all cognates equally as evidence.  Members of this list
        will doubtless have their own views on this matter.

(6) The Austronesian family spread out from Taiwan into the Pacific.

        This assumption again derives mainly from linguistic work,
        though G&J tell us that it is confirmed by genetic and
        archaeological data.  The assumption is necessary in order
        to root the tree.  The program, by itself, can only produce
        an unrooted tree exhibiting degrees of divergence, and a root
        must be introduced from outside, as an auxiliary assumption.
        This means, not just that Taiwan is taken as a geographical root,
        but that the Taiwanese languages are taken as a linguistic root,
divergence from which is the criterion used to set up the tree.
        So, it should be clear that an enormous amount of linguistic
        work had to be done before G&J could even approach the task
        of constructing a tree with their program.  And, as Michael has
        complained, this obvious fact is not at all emphasized by G&J,
        or by Cann.

(7) Though the Austronesian family contains about 1200 languages,
a judicious sample of 77 languages is enough to work out the
family tree in its main lines.

        This sort of assumption is routine in best-tree work, of course.
        G&J chose from Bob Blust's database the 68 languages which were
        represented in the most cognate sets, and then added a further
        nine languages selected to represent those recognized branches
        of Austronesian which were otherwise poorly represented in the
        sample, giving them a final sample of 77 languages.

With these assumptions, G&J put their program to work to compute the
best tree.  They obtained a tree which is strikingly similar to the
one constructed by Austronesianists using traditional methods.  But
there were a few significant differences: some languages were placed
in different branches from the ones where the Austronesianists put
them.  But G&J do not claim that their tree is better.  Rather, they
acknowledge that their program has produced spurious results, probably
because of borrowing.  In their words, Austronesian cultural history
is not "totally tree-like".

Then G&J go on to test two models of the Austronesian expansion:
the 'express-train' model and the 'entangled-bank' model.

The express-train model holds that Austronesian spread out rapidly and
unidirectionally from west to east, with rapid branching, with little
borrowing among branches of Austronesian, with little linguistic
intermingling with the languages of earlier inhabitants, and with
few or no east-to-west movements.  The entangled-bank model holds
the contrary: extensive linguistic intermingling between different
Austronesian languages and between Austronesian and non-Austronesian
languages.  The first model predicts a clean and simple family tree;
the second predicts no identifiable family tree at all.

G&J conclude that their results are highly consistent with the
express-train model, but not at all consistent with the entangled-
bank model.  And both G&J and Cann see this potential for testing
proposed models as a great virtue, perhaps *the* great virtue,
of the best-tree approach.

So there is perhaps something here to think about.  However, as
Cann observes in her commentary, the Austronesian case may not be
typical in linguistics.  Rather, it may be an exceptionally simple
case: a single family expanding rapidly into a vast territory
much of which is not yet inhabited, with significant distances
between the habitable locations.  It remains to be seen whether
G&J's method can produce useful results in other cases, especially
in cases which are known to be messy.

Consider Afro-Asiatic.  This is a recognized family, but one at
the very limit of our ability to detect relationships.  The location
of Proto-AA is not known, and so no geographical root can be
identified.  So far, in spite of vigorous efforts, we have no
accepted reconstruction of Proto-AA, and so we cannot use
Proto-AA as a linguistic root, in the way that the Penn team used
PIE to root their IE tree.  Moreover, the evidence that I have seen
in support of the validity of the AA family is mostly morphological,
not lexical.  So, what -- if anything -- could G&J's best-tree
approach tell us about the family tree of AA?

And what about convergence phenomena?  G&J conclude that their
success in obtaining a good tree shows that borrowing and other
convergence phenomena have not occurred on a large scale within
Austronesian.  Fine.  But what about all those other cases in
which large-scale convergence is known to have occurred?
Presumably G&J's method would simply return no tree in such cases,
but surely the method needs to be tested on a few messy cases,
in order to find out just what it does do.  Until that is done,
I guess we can contain our enthusiasm.

I myself believe that mathematical and computational methods must
eventually prove to be valuable in comparative linguistics.  But
it's interesting that the linguists who engage in such work are
usually far more cautious in their claims than the non-linguists.

Larry Trask
COGS
University of Sussex
Brighton BN1 9QH
UK

larryt at cogs.susx.ac.uk

Tel: 01273-678693 (from UK); +44-1273-678693 (from abroad)
Fax: 01273-671320 (from UK); +44-1273-671320 (from abroad)