6.10 Comparative method

Thu Jan 12 18:01:16 UTC 1995

----------------------------------------------------------------------
LINGUIST List:  Vol-6-10. Thu 12 Jan 1995. ISSN: 1068-4875. Lines: 330

Subject: 6.10 Comparative method

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>

Asst. Editors: Ron Reck <rreck at emunix.emich.edu>
               Ann Dizdar <dizdar at tam2000.tamu.edu>
               Ljuba Veselinova <lveselin at emunix.emich.edu>
               Liz Bodenmiller <eboden at emunix.emich.edu>

-------------------------Directory-------------------------------------

1)
Date: Tue, 03 Jan 1995 09:15:13
From: koontz at alpha.bldr.nist.gov (John E. Koontz)
Subject: Re: 5.1500 Comparative method, Polarization & reviews

2)
Date: Thu, 05 Jan 1995 16:17:04 -0500 (EST)
From: Matthew Dryer (LINDRYER at ubvms.cc.buffalo.edu)
Subject: Genetic Classification

3)
Date: Wed, 11 Jan 1995 14:11:06 +1100 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Re: 6.06 Greenberg (again... sigh)

-------------------------Messages--------------------------------------
1)
Date: Tue, 03 Jan 1995 09:15:13
From: koontz at alpha.bldr.nist.gov (John E. Koontz)
Subject: Re: 5.1500 Comparative method, Polarization & reviews

I am forwarding the following for posting at the request of Robert Rankin
(rankin at ukanvm.cc.unkans.edu).

I do not subscribe to this list and have no wish to join the fray at
present, but when my name is mentioned sometimes the file is forwarded
to me via e-mail.  Thus the following:

Andy Anderson cites me on three points in a series of recent postings.
I have known Andy upwards of 30 years and do not feel that he would in-
tentionally misrepresent my views, but I also feel a couple of things
need clarification.

First, I am uncomfortable about being formally cited as a (secondary)
source of information on Lyle Campbell's paper from the Boulder (Green-
berg) Conference of ca. 1990.  If Andy wishes to distribute an attack
on the paper or its author in written form, he should first obtain an
actual copy of it or, alternatively, await its publication.  I guess
I shouldn't have brought it up in our conversation at the SSILA/AAA
meetings.

Second, I am said to have reported that the geneticists who have studied
the mitochondrial DNA (mtDNA) of sundry Native American and Siberian peo-
ples claim that there were/are "two subgroups within Amerind (aside from
the Eskimo and Athabaskans)."  This is not what I (or the paper's authors
Wallace, Torroni and Schurr, et al.) said.  The authors did not address
themselves to the linguistic problems and most certainly didn't talk about
"subgroups".  Nor did I, since I do not regard the historicity of anything
like "Amerind" as even remotely established.

The authors of the paper did posit at least four "migrations".  They do not
discuss the most recent, Eskomo-Aleut, in their abstract, but I THINK they
gave a time depth figure of about 6000 years BP (Before Present) for it
orally--don't quote me.  FROM THEIR ABSTRACT:  For what they call Na-Dene
their figure is 7000-10000 years BP.  Then they say they have evidence for
at least two "migrations" preceding that.  One comes between 12000-15000
BP and the earliest between 26000-34000 BP.  Figures as high as 40000 BP
were mentioned orally, as I recall.  They did not attempt to correlate
their figures with our knowledge of periods of glaciation or the periodic
existence of the land bridge in Beringia.

I leave it to readers to decide what this portends for the Amerind hypo-
thesis or its proposed (Glotto)chronology, but a warning is in order in
any event.  Note that I have written "migration" in quotes above.  This
is not because I wish to pejorate the term; it is because geneticists
use it in a very special way.  For them it has to do solely with the ap-
pearance of specific genetic material in American populations.  They
then assume a common ancestor and calculate the number of millennia by
positing a uniform mutation rate for mtDNA.  The material and theories
they work with force this definition of migration on them.  All this
says nothing about the situation "on the ground."  In reality though,
each of these genetic migrations can have included many distinct
movements of people across Beringia over a great many years--perhaps
centuries or even millennia.  And they may have represented many ling-
guistic groups.  All that is required in order for entire clusters of
migrations "on the ground" to get read as a single mtDNA "migration"
is a relatively homogeneous gene pool in Eastern Siberia over the
particular time span when the "genetic mutation" occurred.

The evidence does indeed suggest four GENETIC migrations, but it really
says little or nothing about how many "real" migrations there were with-
in each of the four clusters, nor does it say anything about linguistic
diversity--much less "subgroups of Amerind."  We may wish it did, but
it doesn't.  I do note with interest however the rough correlation
between the geneticists' oldest figures and the calculations of Nichols
(1990 in Language 66.3) based on linguistic diversity in the Western
Hemisphere.  The more recent sets of mtDNA dates fall within the esta-
blished archaeological ballpark for Clovis believers, although the
earliest set certainly does not.

One very short contribution of my own here--mostly my wife's actually,
since she is a molecular geneticist and we talk about these things
over breakfast.  The yardstick used by mtDNA geneticists in these cal-
culations may not be appreciably better than that used in glottochrono-
logy, i.e., genetic mutation takes place at a rate which is only RELA-
TIVELY constant.  It can be speeded up by various singular events from
cosmic ray bombardment to ingesting certain fungi infecting the grain
from your cache pit.  Biologists try to allow for this sort of thing,
but as you can see from the plus/minus dates for each cluster, we are
not talking about something as precise as dendrochronology or even
radiocarbon dating.  The mtDNA studies are very interesting but we must
bear in mind their limitations and special use of the term "migration".

Lastly, in an earlier posting Andy mentions that I had examined Green-
berg's notebooks and determined how he had mislabeled so much of his
Siouan data in LIA.  Andy's description of the way the notebooks are
laid out is correct, but I have only actually seen xeroxes of the pages
of Siouan entries, not the notebooks themselves.  I might add that the
Siouan entries in the notebook are hard by the Iroquoian, Caddoan, Yuchi
entries demonstrating once again that Greenberg had decided on the final
classification of these families when he laid out his notebook design
and before the vocabularies from the languages were entered.

My thanks to John Koontz for posting this.

                          Sincerely,

Bob Rankin (University of Kansas) (rankin at ukanvm.cc.ukans.edu)

--------------------------------------------------------------------------
2)
Date: Thu, 05 Jan 1995 16:17:04 -0500 (EST)
From: Matthew Dryer (LINDRYER at ubvms.cc.buffalo.edu)
Subject: Genetic Classification

I wish to make some comments on an issue that recent discussion of
Nostratic and the problem of "demonstrating" distant genetic relationships
has skirted around that I believe underlies some of the issues that various
people have been directly addressing.  An assumption that seems to underly
much of the discussion is that hypotheses regarding genetic relationships
are not interesting unless they can be proven to be true.  I find this a
rather odd assumption, and one that does not seem to be made about any
other kinds of hypotheses in linguistics (or anywhere else in science as
far as I know).  And let us set aside for the sake of argument the
oft-noted point that the notion of proof is not really applicable to
empirical hypotheses, and assume that the term is to be used loosely for
some arbitrary high level of certainty.  It seems fair to say that there is
a fairly widespread disinterest in hypotheses like the Nostratic hypothesis
because it is widely believed (and I will assume it is true here for the
sake of argument) that the available evidence for Nostratic falls short of
this imaginary level of certainty which deserves the label "proven".  A
common type of reaction to unproven hypotheses is that it has not been
demonstrated that the observed similarities might not be due to chance
and/or borrowing.

But suppose that someone were to take the same attitude towards comparative
reconstruction of protolanguages.  Suppose that someone were to object to
comparative reconstruction of anything but very shallow groups on the
grounds that one can never prove that the reconstructions are correct.
Just as one can object to certain claims of genetic relationships on the
grounds that one cannot conclusively eliminate the possibility that the
observed similarities might be due to accident and/or borrowing, one could
equally well object to virtually ALL hypotheses surrounding comparative
reconstruction on the grounds that one cannot conclusively eliminate
alternative possibilities.  The comparative method is a way to come up with
the best guess one can make about a protolanguage; it never provides proof
that the reconstruction is in fact correct.  So why bother doing it?

The answer should be obvious: hypotheses which represent our best guesses
at any point in time are what much of science is about.  By why do so many
linguists seem to object to applying the same way of thinking to hypotheses
about genetic relationships?  Why is it that many historical linguists find
the hypotheses like the Nostratic hypothesis either laughable or upsetting?
Why don't they react the same way to comparative reconstructions, since
they also are "unproven"?  Why don't they rush out and read everything they
can find on Nostratic and conclude "The evidence is tantalizing but not
conclusive; it's a really exciting hypothesis"?  Why is there such a double
standard?

I want to suggest an answer to this question, an answer which, if right,
provides insight into the nature of many debates surrounding controversial
hypotheses of genetic relationship.  Namely, some people find questions of
genetic classification intrinsically interesting, quite apart from any
detailed historical work that plays a role in supporting hypotheses.  Other
people, however, are primarily interested in the detailed historical work
itself, and do not find questions of genetic classification intrinsically
interesting, but only interesting in so far as they are an inevitable
consequence of historical work.  People of the first sort are more likely
to find recent work reclassifying Penutian languages exciting, while people
of the latter sort are unlikely to react that way, unless they are Penutian
specialists.

As one moves back in time, the ability to apply the comparative method
becomes increasingly difficult, and detailed historical work becomes
increasingly speculative (and to many historical linguists, dissatisfying).
But at any time depth, we can always be much more confident of the genetic
classification than we can of any comparative reconstructions.  Our
confidence in Indo-European as a language family is surely greater than our
confidence in ANY specific claims about Proto-Indo-European.  But as we
move further back in time, we should expect there to be hypotheses that we
cannot be entirely confident of, but for which there is at least some
promising evidence, where any comparative reconstruction is going to be
sufficiently speculative as to not be satisfying to linguists interested in
traditional comparative work.  And since these linguists are not interested
in genetic classification except as a biproduct of detailed historical
work, such linguists are likely to find the hypotheses uninteresting.  On
the other hand, for linguists who find questions of genetic classification
inherently interesting, the fact that detailed historical work may not be
possible is irrelevant, and the fact that the hypothesis is unproven or
unprovable may be no more a source of concern than the fact comparative
reconstructions are always unproven and unprovable.

If this view is correct, much of the debate surrounding controversial
hypotheses in genetic classification is based, not on substantive
questions, but simply on what sorts of questions different people find
interesting.

Matthew Dryer

--------------------------------------------------------------------------
3)
Date: Wed, 11 Jan 1995 14:11:06 +1100 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Re: 6.06 Greenberg (again... sigh)

) 3)
) Date: Thu, 22 Dec 1994 21:02 -0500 (EST)
) From: Mike_Maxwell at sil.org
) Subject: Evidence against Greenberg?
)
) Perhaps the best evidence against Greenberg's hypothesis would be to show
) that his methods, when applied *in the same way* to randomly chosen samples
) of languages of the Earth (including some Amerindian languages), group them
) in the same way and with the same degree of (un)certainty as those methods
) group Amerindian languages (less the Athabaskan languages) together.  (I
) put the stars around "in the same way" because one can easily distort
) someone else's methods.)  As I understand it, some people have tried
) applying Greenberg's method to one Amerindian language and one other
) language (Finnish was one such, I believe), but I have never heard of a
) large-scale comparison being done in this way.  (And I believe Greenberg
) says his method is best used for mass comparison, not one-on-one.)
)

Here we go again. Some bean counter some day will tot up the number of
times "Greenberg" occurs here and will rate the corresponding work
as "highly influential". Never mind.

There is no difference between mass comparison and pair comparison.
When you engage in mass comparison you carry out a large number
of pair comparisons. The greater the number of comparisons, the more
chances you have of finding cognates.. and chance resemblances.
Take two dice and roll them. How often will they show the same
score? Take a bagful of them and empty it onto the floor.
Matches galore.

But that does not matter.

We've had recently a long, long, exchange on the comparative method,
in which Alexis Manaster Ramer made a point -- which he seemed to
believe as important -- that no language had been found to retain
less than 86% of some sample wordlist (Swadesh's 100? Doesn't matter
as you shall soon see) per thousand years. The claim is false,
but never mind, I'll grant it as true. I'll even grant you 90%
retention. America, they say, was populated 18,000 years ago.
Well, not so, evidence from Brazil now seem to push it back to
50,000 BP. But I'll grant you 18,000 BP. And that everybody since
the Great Crossing was careful not to be linguistically overly
innovative, so that there exist at least two maximally distant
languages which have retained 90% of their vocabulary millennium
in millenium out. Today you could expect to see between them
0.9^(18*2) = 0.0225, i.e. 2.25% words in common. On that famous
100-item highly stable "basic" vocabulary. So that's your
Proto-Amerind reconstituted. Now, of course, we have not taken
chance resemblances into account. If you remember Greenberg
Sci.Am. article and his calculations, he estimates the
probability of chance resemblances at 1 in 250. But he
forgets that he allows a bit of metathesis. In fact, if
you read carefully Ruhlen's "On the Origin of Languages"
complete anagramming, since he list Irish "bligim" as
cognate with his *malk'a. There are six ways in which
you can combine 3 consonants, so that is really one
chance of resemblance in 42 (250/6 = a tad under 42). Using
their figure, then, how many chance resemblances show you
expect to find in a 100-item wordlist? 100/42 = 2.38.
Bingo! More than real cognates after 18,000 years with
very conservative languages.

Now, *if* America was really populated 50,000 years ago
we should see 0.9^(50*2) = 0.002656% of your 100-item
list preserved. That's one word in 37,649. So out of every
pair of 100-item lists you will find, on the average,
1/37649*100=0.0027 wrods in common. Meaning that you can
look forward to examining some 376 pairs before you find
one single cognate.

But thanks to mass comparison, you are sure to find it. Only compare
50 seemingly *unrelated* languages (Because you want to pick
maximally distant languages). That gives you 50*(50-1)/2 = 1225
pairwise comparisons. With a bit of luck, that will give 3 or 4 cognates,
each attested by 2 or 3 languages. ... and stacks of spurious
resemblances, each attested by far many more languages than your
true cognates.

Perhaps America was not populated 50,000 years ago. But Australia
was at least 40,000 BP. That does not prevent some from
reconstructing Proto-Australian. And trying to link it to
Indo-European.

Enough fun with figures. Why don't you try to *simulate* a paltry
30,000 years worth of evolution of 30 languages each represented
by 100 words, with a one-in-250 (see how generous I am) chance
of resemblances? (Warning: advertisement follows) Download
glotto02.zip from from directory /pc/linguistics at garbo.uwasa.fi,
unzip it, read the documentation about programs GLOTSIM, GLOTTREE
do it, and see.

(De toutes facons, autant souffler dans un violon. C'est tellement
plus rigolo d'aller s'imaginer qu'on peut demeler le passe perdu
dans la nuit des temps).

j.guy at trl.oz.au

--------------------------------------------------------------------------
LINGUIST List: Vol-6-10.