6.884, Comparative method

Mon Jun 26 19:50:42 UTC 1995

----------------------------------------------------------------------
LINGUIST List:  Vol-6-884. Mon 26 Jun 1995. ISSN: 1068-4875. Lines: 292

Subject: 6.884, Comparative method

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>

Assoc. Editor: Ljuba Veselinova <lveselin at emunix.emich.edu>
Asst. Editors: Ron Reck <rreck at emunix.emich.edu>
               Ann Dizdar <dizdar at tam2000.tamu.edu>
               Annemarie Valdez <avaldez at emunix.emich.edu>

-------------------------Directory-------------------------------------

1)
Date: Thu, 22 Jun 1995 12:38:56 -0500
From: "Paul Purdom" (pwp at cs.indiana.edu)
Subject: Re:  6.825, Comparative Method

2)
Date: Fri, 23 Jun 1995 15:13:17 +1000 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Comparative method, again.

-------------------------Messages--------------------------------------
1)
Date: Thu, 22 Jun 1995 12:38:56 -0500
From: "Paul Purdom" (pwp at cs.indiana.edu)
Subject: Re:  6.825, Comparative Method

Content-Length: 6081

It is interesting to contrast the way most linguistic consider comparing
things with the way that biologistic do the same thing. (I am a computer
scientists, so I am not a member of either group.) There is currently a lot
of discussion among linguists about the advantages of n-way vs 2-way
comparisons.

Among biologists who build phylogenetic trees this is a dead issue. It is
clear that when you have comparable data on a large number of organisms, it
is better to compare them all at once. (This is for getting the best possible
answer out of a limited amount of data. When limited computer time is the
main issue, then they compare as many organisms at a time as their computer
resources will permit.)

The reason for this is quite simple. The biggest problem on comparing
distantly related organisms is that they do not have much similarity. Often
they have so little that it does not show up in a 2-way comparison. When one
has many organisms to compare, one can use information from closely related
organisms to make conclusions about their common ancestor. Then the
relationship between the common ancestors (which have closer relationships
than the more distant of the current organisms) are easier to tell. The
methods do this all at once. (In explaining what happens it is easier to
describe a step by step process.) Without this added power of multi-way
comparisons we would have no idea whether humans are more like yeast or sweat
potatoes (the answer is probably yeast).

Among linguists their seems to be a strong desire to either definitively
decide whether certain languages are related or else to say that it can not
be determined. Biologists have many more shades of gray in their view of the
world. They know that their phylogenetic techniques can not definitively
determine who organisms are related to each other. In some cases the data
shows that the probability for a particular relationship is 1-10^{-6} (i.e.,
almost certain) where as in other cases there are three possibilities and the
more likely one has a 40 percent chance of being right. Since they believe in
the basic methods and understand their limitations, they just analyze the
data and let the results speak for themselves.

When I read attacks on n-way analyses in linguistics I apply the arguments to
the phylogenetic case. In most cases the arguments seem to apply without
strain, but the arguments lead to conclusions are contrary to what is known.

Biologists have many ways for doing n-way comparisons, and they they have
done extensive analyses on how the different methods compare on both real and
mathematically generated data. (The interested reader can find many such
articles in Systematic Biology.)

Two important basic methods are parsimony and maximum likelihood. With
parsimony there is a list of characteristics on which the organisms can be
compared (often this will be the first position of the DNA string for a
particular protein, the second position, etc.). For each characteristic, each
organism has a state (such as an A, T, C, or G for the position). One then
implicitly considers every possible tree that connects these organisms
together. For each such tree, one labels the internal nodes with states in a
way that minimizes how many changes of states occur in the tree. The tree
with the minimum number of changes of states is the winner. Trees with close
to the minimum are other trees that also have a large probability of
representing the true state of affairs.

(For those interested in more details, there is a quick algorithm to find
out how many changes there are in the best labeling for a particular tree.
It has a bottom up pass using the idea that if two children both have state A
for a character, the there is a minimum labeling where the parent has state
A, and if one child has state A while the other has state B, then there is a
minimum labeling where the parent has either state A or state B. If one
follows this with a top down pass one can also determine all minimum
minimum
labelings for a particular node.)

In maximum likelihood one has a mathematical model of how likely transactions
are between different states. For each possible tree one can computer how
likely the the observed data is. There tree with the greatest likelihood.

When naive application is made of these techniques, so care is needed with
respect to how fast different characteristics change. If the character
changes to rapidly, it just contributes noise. If it changes too slowly it is
unlikely to have any changes. In more sophisticated application of the
techniques, one of the parameters is how fast the character changes. With
these methods, all of the data is useful. The fast changing characters help
resolve details among closely related organisms while the slowly changing
characters help determine how distantly related groups relate to each other.
The biologists are well aware of the fact that the rate of change of the
character needs to be well matched to the time since the split between the
organisms if they are to obtain a lot of information from just a few
character.

For either of these methods one can determine how likely alternate answers
are by using resampling techniques on the data and redoing the analysis.

I believe it would be extremely useful for someone to apply these techniques
to the question of the relationship between languages. I would do this myself
except that it is clear that I do not know enough about the data that
linguists use and how to best measure similarity. Perhaps the data where
these techniques could be applied most easily is word list data, where one
has lists of comparable words in various languages.

I do not wish to speak against people who have made detailed criticisms of word
list studies, but I would like to suggest that when someone believes that
ideas have not been properly compared, they should suggest the proper
comparison and redo the analysis. (Of course, that will be a practical
suggestion only after some one produces a program to do the analysis from
appropriately presented word lists.)

--------------------------------------------------------------------------
2)
Date: Fri, 23 Jun 1995 15:13:17 +1000 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Comparative method, again.

>From alderson at netcom.com (Richard M. Alderson III)
---------------------------begin quote------------------------------
If we are looking at a grouping of languages of which we are
uncertain of relationships, and the number of potential n-way cognates is as
low as random chance would dictate, then the likelihood is against their
being closely-enough related to pursue reconstruction.
---------------------------end quote--------------------------------

Exactly, you are spot on!

>From AVVOVIN at MIAMIU.ACS.MUOHIO.EDU

--------------------------begin quote-------------------------------
The parallels based on regular
correspondences are not CHANCE or RANDOM parallels and therefore the
proposed statistical games do not apply to the case.
--------------------------end quote---------------------------------

Firstly, "games" is out of place here. Calling a spade a pitch fork does
not help. Therefore call those "games" what they are: "tests". Or better
still, call those "statistical games" what you probably refer to:
"Monte-Carlo simulations". All too often I have heard this line "I do
not understand statistics, I know no mathematics, therefore this does
not concern me".

Second, you misunderstand the meaning of "chance" and "random". The
outcome of the roll of a die, or the toss of a coin, is not chance, nor
random. It is determined by a host of physical factors. Its outcome,
however, is not predictable, precisely because so many factors are
involved that we do not know how to analyze them. In the same manner,
you cannot predict what the phonology, grammar, and lexicon of the
English of here or there will be one thousand years from now. Like the
throw of a die, the future state of a language is unpredictable, hence
"random", but not random. Taking the analogy further, the six possible
outcomes of the roll of a die can each issue from many starting states.
Likewise my earlier example of Lehali /E/ *regularly* derivable from
*ikan, but also *regularly* derivable from a host of other putative
protoforms. That is where uncertainty comes in, despite regular
correspondances, and where resorting to statistical methods is not
only justified but necessary.

--------------------------begin quote---------------------------------
If you do not
believe it, take any pair of a priori unrelated languages, such as for
example Mandarin and Eskimo, and try to establish regula r phonetic
correspondences
--------------------------end quote-----------------------------------

I will do so, as I have just done on sci.archaeology.mesoamerican on
occasion of a crackpot post on how Quechua and Hebrew are related.
Take French and Japanese:

J. abura 'oil, grease, fat'    F. boer   'butter'
J. furo  'hot bath'            F. fur    'oven'
J. kuni  'country'             F. kwE~   'corner, spot, place'
                               (as in: un coin tranquille)
J. aruk- 'to walk'             F. ark-   'to walk'

I have more, but I will stop here.
The above evidence these *regular* sound correspondences:

Japanese                      French
f                             f
r                             r
k                             k
b                             b

As is often done, vowels, because too unstable, are ignored. Yet note
that we have the beginning of regular correspondences in vowels. All I
need to do now is grab a Japanese dictionary and look for more chance
coincidences, reporting only those which fit my bogus "regular" sound
correspondences.

>From Alexis Manaster Ramer (amr at CS.Wayne.EDU)
---------------------begin quote----------------------------
(1) Janhunen says that the probability of a match occurring
purely by chance when you compare Japanese with four languages
is four times what it is when you compare it with one language.
This simply cannot be true because probabilities are values
between 0 and 1.  If the probability in the case of a binary
comparison was say .5, then he would be predicting that it
would be 2 in the case of n-ary comparison, which is impossible,
because 2 is not between 0 and 1.
---------------------end quote------------------------------

I do not know what Janhunen says, because never during this
discussion has there been a direct quote of what he said.
What he may have said is that, comparing one language with four
unrelated languages, you will find four times as many false
cognates as if you had compared it with only one unrelated
language, and that is true.

---------------------begin quote-----------------------------
(2) The other fallacy is not purely mathematical, although I
suspect that it involves elements of confusion.  In any case,
no one who argues for n-ary comparison EVER talks about getting
a match in 2 out of n languages.  Now, if we look at Guy's
numbers, in his scenario of a 100-word list with no shifted
meanings, he came up with 14.5 probable spurious matches
in a binary comparison but only 5.8 when you are looking for
a match between 3 out of 5 languages, 0.13 when you look for
one between 4 out of 5, and he does not give the much smaller
number yet in the case of 5 out of 5.
---------------------end quote------------------------------

When I do not give numbers, it is because there are, exactly and
precisely, zero. I suggest that, instead of discussing the reported
results of my simulations, you carry out similar experiments and see for
yourself. Just ftp change00.zip in directory pc/linguistics at
garbo.uwasa.fi

---------------------begin quote----------------------------
I am not sure how Jacques defines spurious and so I have not
verified the numbers, but they are certainly on the right
orders of magnitude.
---------------------end quote-----------------------------

Once again, read my article in the latest issue of Anthropos. Anthropos
is not, I should think, such an arcane journal that your library does
not have it.

---------------------begin quote----------------------------
Thus, in Guy's
example a match between n - 2 languages out of 5 was
less likely to occur by chance than one between 2 out of 2.
But if n were 100, i.e., you were comparing 100 languages,
then you would not need n - 2 (i.e., 98) languages to agree
to be able to do better than with a binary comparison.  It
would be many many fewer (although I don't know how many
since I do not know what formula Jacques is using and what
he is assuming about the initial probability of a match).
---------------------end quote-----------------------------

For the second time around, the full description of the simulation
algorithm, and the discussion of how it was arrived at, has been
published in a widely available journal. The algorithm itself is freely
available to anyone with ftp access.

This reminds me of the conversation I had with Isidore Dyen in 1981 at
the Third International Conference on Austronesian Linguistics. I had
presented a paper where, amongst other things, I mentioned Dyen's
article on the estimation of the individual retention rates of lexical
items, paper co-authored to two other people by the names of James and
Cole. Dyen said to me: "Do not talk to me of statistics. I know nothing
about statistics, I understand nothing of your mathematics. It does not
concern me. Talk to those other people [James and Cole]. One is a
programmer, the other a statistician. Don't ask me about that, ask
*them*".

--------------------------------------------------------------------------
LINGUIST List: Vol-6-884.