Cambridge and Greenberg's methods

Wed Aug 25 15:32:38 UTC 1999

I am very grateful to Larry Trask for his discussion of
Joseph Greenberg's methods and Cambridge methods today.
I hope we can have more discussion of the Cambridge methods.

But I will insist (and provide evidence below)
that Trask still does not understand that Greenberg is doing
something COMPLETELY DIFFERENT FROM
claiming to prove particular language families are related.

Barring the difference of whether one expresses one's conclusions as
rooted family trees (Greenberg, I don't remember whether he used
tree diagrams for this) or
unrooted family trees (the Cambridge Group that Trask refers to),
there may be little difference between them
(little difference in matters noted in Trask's discussion today; of course
there may be other differences that we consider important).

[LT]
>>Ruhlen wishes to embrace the conclusion `All languages
>> are related.'

[LA]
>> As I have understood Joseph Greenberg's clearer and more cogent
>> statements, his own work actually does NOT propose to prove any such
>> conclusion.  It is rather an ASSUMPTION that all languages are or might
>> be related (i.e. we are not to exclude that).

[LT]
>The assumption that all languages are related is out of order.

>The assumption that all languages *might* be related is hardly an
>assumption at all, and in any case such an idea is excluded by no one.

The assumption is of course quite in order, to see what it might lead to,
as long as it is merely that, an assumption.
So there is really no difference between the above two statements just above.
Except some emotional flavoring which antagonizes some people.
Because even if all languages ARE ultimately related, that does NOT
imply that standard reconstructive method can establish the relation between
two randomly chosen languages.  If Albanian is so difficult merely
within Indo-European, then much more difficult will be more distant cases.

All one needs to understand Greenberg's reasoning is to assume
that all languages *might* be related.

As far as I can tell, the only difference noted by Trask
is in whether one expresses one's tentative conclusions as

(if all languages are related, then here are some
conceivable family trees for the deeper connections),
or as
(here are some conceivable deeper connections,
though we do not express them as family trees).

In terms of politics, it would perhaps have been more useful in the long run
if Greenberg had not expressed his conclusions AS IF they were family trees.

That is clearly a red flag to some people.  They seem to treat such
expressions
as if they were final conclusions proven by sufficient evidence,
for two languages or language families taken in isolation from all others,
rather than one investigator's claim that the evidence he evaluated as he
evaluates it suggests those trees rather than some others (merely that,
the preferable conclusion to others rather than a proven conclusion).
Personally, I have no problem at all treating that as the expression of a
tentative hypothesis only (how could it have ever been anything else?),
and I have been simply amazed at the antagonisms.  Shouldn't we always
try to make the best use of the contributions of each one of us?
Didn't Greenberg attempt to systematize a large amount of data which
others can then correct and improve on?

[LA]
>> Greenberg's method of comparison serves to find the CLOSEST
>> resemblances (merely that, CLOSEST). In the Americas, his method
>> leads him to the conclusion (no surprise) that Eskimo-Aliut is not
>> closely related to any other Amerindian languages, and that
>> Athabaskan / Na-Dene (with outliers)  is not closely related to any
>> other Amerindian languages, (though conceivably not as distant from
>> them as Eskimo-Aliut ?).

[LT]
>This account of Greenberg's work makes it appear to resemble certain far
>more rigorous work in progress elsewhere, such as at Cambridge
>University.  The Cambridge group are working with a variety of
>algorithms which can, in principle, determine degree of closeness, and
>which can hence produce unrooted trees illustrating relative linguistic
>distance.  But these algorithms are utterly incapable of distinguishing
>relatedness and unrelatedness.  If, for example, you run one of the
>algorithms with a bunch of IE languages plus Basque and Chinese, the
>result is a tree showing Basque and Chinese as the most divergent
>members -- that's all.

>If the same is true of Greenberg's highly informal approach, then G
>cannot distinguish relatedness from unrelatedness,

Exactly.  I have always, from the very beginning, understood that
was exactly what Greenberg's approach did.  He would classify
Basque and Chinese as the most divergent members of such a set.
He would conclude neither "related" nor "unrelated" if he did not assume
all languages were related.  Given the assumption of all languages
potentially related, which since unprovable amounts merely to a
way of expressing one's hypotheses, he concludes "most divergent",
just as he in fact did for the families Athabaskan and Eskimo-Aleut.
The difference in these modes of expression is utterly trivial as long
as we are concerned with tentative hypotheses, not with ultimate truth.

[Barring of course the comments "rigorous" vs. "informal",
which is a completely separate issue, whether algorithms do adequately
capture the best human judgments; please see below.]

[LT]
>and he has no
>business setting up imaginary "families".

Given that this is merely a way of expressing degree of divergence,
under Greenberg's assumptions, I find it unobjectionable.
I don't draw any more conclusions from it than are warranted,
one investigator's tentative judgements of stronger vs. weaker
resemblances.  That is all.

***

I am not sure what Cambridge's "unrooted trees" are, other than
as a graph-theoretic term that the direction of change is unspecified,
because no node is singled out as an "origin".
In addition to that, the use of unrooted trees may also be a way to
acknowledge in part the positions of those who suggest we should
be giving much more consideration to dialect networks, areal phenomena,
etc. etc. than to binary trees.  It is a perfectly legitimate position that
we are forced to that at greater time depths, where it is harder to
distinguish
borrowings from genetic inheritances (where, at sufficient remove,
borrowings actually become genetic inheritances for most practical purposes).

In fact, I think the lack of major cleavages in Greenberg's Amerind,
that is, everything except Athabaskan and Eskimo-Aleut,
is virtually the same conclusion as having an unrooted tree or dialect network
or even, WITHIN that more limited context, not linking the families of
which it is composed at the highest levels.

AND, notice, it could also simply be an expression of an inability of
Greenberg's methods to penetrate deeper, to distinguish at such a depth
between neighbor-influences such as borrowing and genetic inheritance.
Perhaps here there really is so much noise that Greenberg's method of
judgements from data sets cannot yield much.  I do not claim to know.
But that is NOT the same as saying I conclude anyone should completely
discount Greenberg's estimates.

[LA]
>> His actual conclusions are about relative UNRELATEDNESS of language
>> families (notice, not about absolute unrelatedness, which he does
>> not claim his method has the power to evaluate).

[LT]
>Just as well.  It is logically impossible to prove absolute
>unrelatedness, and G would be mad to undertake such a thing.

[LA]
>> Beyond that, Greenberg's methods do NOT enable him to establish any
>> similar degree of unrelatedness among the remaining languages of the
>> Americas.

>> I hope I have stated that carefully enough, to make obvious that it
>> is a matter of degree, not absolutes, and that Greenberg's method
>> actually demonstrates the points of SEPARATION rather than the
>> points of UNION.

[LT]
>Fine, but then G's methods do not suffice to set up language families --
>even though that is exactly what he does.

No he does not.  Greenberg's language families (family trees)
are an expression precisely of separations just as much as of unions.
The two are equivalent, under the assumption that we cannot know
about absolute truth of language relationship.

I will readily admit that Greenberg should have REPEATED more
often and more clearly that his method assumed ultimate relatedness,
and merely dealt with different degrees of closeness, that his method
did not purport to prove relatedness of two particular language families.

[LA]
>> Greenberg's method is potentially useful in that it is likely to
>> reveal some deep language family relationships which were not
>> previously suspected,

[LT]
>I said exactly this on page 389 of my textbook.

I look forward to reading this.

***

[LA]
>> AS LONG AS we do not introduce systematic biases
>> which overpower whatever residual similarities still exist
>> despite all of the changes which obscure those deep relationships.

[LT]
>I'd be interested to know just what `systematic biases' you have in
>mind.

I gave a paper at a Berkeley meeting once, in which I mentioned the
possibility that a greater specialization by Greenberg in
Andean-Equatorial more than in the other languages of South America
might in some way bias his conclusions to find closer relations between
that family and other families, simply because with more of the
Andean-Equatorial vocabulary running around in his head, he would
be more likely to be struck by similarities of other languages families
to something he already knew in Andean-Equatorial, more likely than
to be struck by similarities among other languages families other than
Andean-Equatorial.   I simply don't remember
whether I felt at the time I could draw any conclusions.

***

[LA]
>> In other words, mere noise in the data, or dirty data,
>> if the noise or dirt are random, should not be expected to selectively
>> bias our judgements of closeness of resemblance...

[LT]
>No.  I can't agree.

I find the "don't agree" puzzling, since the definition of "random"
precludes any systematic bias, almost by definition.
Trask continues to explain this (which I cite later).

There is a way of evaluating this, at least partly.
If we take the same data sets that Greenberg used,
and in that Berkeley paper I did it for South American languages,
and consider different levels of strictness of correspondence
in sound and meaning, we can divide the vocabulary matches
proposed in different sets.  Taking only the closest matches,
do we get a different set of closest language relations?
(This might of course be because the closest matches represent
middling-ancient borrowings, not in fact the very oldest layer
of genetic inheritances, even assuming ultimate relatedness.)
As I remember, the results gave only a slightly different
degree of closeness for some particular language families.
That might suggest that very deep genetic relations and middle-ancient
or recent borrowings were not terribly different in the relations
they reflect (genetic vs. neighbors).  No big surprise.

[LT explaining why he believes random noise could affect conclusions]
     (But (LA) he is really discussing conclusions yes/no about relatedness
     as ultimate truth, what was NOT in consideration above,
     instead of RELATIVE degrees of divergence of W from X vs. from Y)

[LT]
>Suppose two languages A and B are genuinely but distantly related.
>In this case, it is at least conceivable that false positives (spurious
>matches) would be counterbalanced by false negatives (the overlooking of
>genuine evidence).

>But suppose the two languages are not in fact related at all.  In this
>case, false negatives cannot exist, because there is no genuine evidence
>to be overlooked.  Hence the only possible errors are false positives:
>spurious evidence.  And the great danger is that the accumulation of
>false positives will lead to the positing of spurious relationships.
>Many of G's critics have hammered him precisely on this point.

The point I made about random noise had NOT to do with whether
particular languages are ultimately related, but whether a given language
or family W is more closely related to others X or Y.
In that context, why should "random" (by definition) noise in the data
selectively favor W to X rather than W to Y?  No possible reason that
I can imagine.

Quite a separate question is whether PARTICULAR STRUCTURES
of languages will be handled by our judgements, whether human ones
or algorithmic ones, in different ways such that increasing the amount
of noise in the data selectively affects our ability to make use of
data from those differently structured languages.  The obvious example
is languages where most morphemes are CV.  But if we define "noise"
carefully, as a percentage of the morpheme's information content,
or in terms of degree-of-deviation along paths of phonetic or semantic
change, then it is not clear that CV-morpheme languages are at a disadvantage
in terms of noise.

[As a tangent, a declaration that CV-morpheme languages cannot be subjected
to comparative linguistics because no morpheme could meet the minimal
CVC criterion for comparison is simply silly.  That proves that the minimal
CVC criterion is merely a convenience, an indication of our preferences,
for greater security (of course we prefer greater security)
and not an absolute requirement for comparison.
Yet many comparativists use this criterion as if it were an absolute!
A real language which I have been told is subject to this limitation
of having only or mainly CV morphemes is Yuchi.]

***

[LA]
>> [and we can study how we make such judgements to try to strengthen
>> this component of Greenberg's method, to strengthen their robustness
>> against noisy data and our mental failings of judgement]

>In their present form, G's methods appear to me to have no robustness at
>all.  Words are similar if Greenberg says they are.  And languages are
>related if Greenberg judges that he has found enough similarities
>between them.  There are no objective criteria or procedures at all, and
>there is no possibility that anyone else could replicate G's work.

Greenberg's methods are much more robust when applied as he applies
them, to estimating which language families are more closely related,
than they would be if they were applied to try to yield a conclusion
about two language families being absolutely related (vs. not related).
This is almost always misunderstood, and Trask's switch between the
two kinds of questions in the discussion quoted above
of the effects of random noise
seems to indicate that he has not seen this either.

MOST comparativists are focused on particular languages,
which I think is a reason why they do not understand what Greenberg has done,
or its strengths.  They imagine him doing what they regard with some
reason as impressionistic judgements on a particular pair of languages,
to conclude that that pair are related.

IF that were what Greenberg had done, then the criticisms would be
ENTIRELY APPROPRIATE.  But that is not what Greenberg did.

***

Algorithms vs. Human Judgements:

Trask seems to approve the Cambridge use of Algorithms,
and to discount Greenberg's judgements of similarities.
He calls the one "rigorous" and the other "highly informal".

I don't think either is necessarily better than the other.
The assumptions built into each can systematically bias the
results, and such bias will be an increasing problem for BOTH
with increasing time depth and increasing noise in the data.

That is why I have consistently emphasized that we need to explicitly
EVALUATE OUR TOOLS, whether algorithms or human judgements.
The tools have an unfortunate tendency to become taken for granted,
forgetting that all tools have unknown biases.

Here is an illustrative example of where I think judgements can go wrong.

Which is the "closer" pair, in the sense that they are more likely
to descend from a common genetic proto-form
(or perhaps even to be borrowed at some time depth)?

kane
pone

or

kone
pane

??

Many comparativists would not see any difference between these.
Yet, because distinctive features may move from segment to segment,
one can argue that it might be the second pair should be treated as
closer.  Each of them has labialization in ONE of the first two segments.
As merely ONE of several possible sources for this,
we can easily imagine if we had:

*kwa > ko   and  *kwa > pa.

On the other hand, of course, it could be that the *o > a / p__.
So these things are tough, when attempting estimates of possible
relations at great language depths.  Do the two language families
compared both have distinctive *p distinct from *k ??  Does either
of them have a pattern of assimilation of vowels to neighboring
consonants?  And so on,
other information can affect either human judgements or algorithms.
We are not involved in judgements of similarity alone, but rather
in judgements of whether a given pair could have a common proto-form.

I do NOT assume that algorithms are inherently superior to human
judgements, at least not yet.  Assumptions are built into all algorithms,
and even the designers may not be aware of all the assumptions
they are building in.  That is merely normal.

Lloyd Anderson
Ecological Linguistics