6.506 Comparative method

Wed Apr 5 14:02:38 UTC 1995

----------------------------------------------------------------------
LINGUIST List:  Vol-6-506. Wed 05 Apr 1995. ISSN: 1068-4875. Lines: 279

Subject: 6.506 Comparative method

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>

Asst. Editors: Ron Reck <rreck at emunix.emich.edu>
               Ann Dizdar <dizdar at tam2000.tamu.edu>
               Ljuba Veselinova <lveselin at emunix.emich.edu>
               Annemarie Valdez <avaldez at emunix.emich.edu>

-------------------------Directory-------------------------------------

1)
Date: Mon, 27 Mar 1995 14:06:24 +1000 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Comparative method: answers to Lloyd Anderson's questions

-------------------------Messages--------------------------------------
1)
Date: Mon, 27 Mar 1995 14:06:24 +1000 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Comparative method: answers to Lloyd Anderson's questions

I am going to take the time to answer Lloyd Anderson's questions 5 to 8,
stuff the design and implementation of a "database of corporate icons
and acronyms for the use of systems developers, to ensure the same
look-and-feel across the software used throughout Telecom" (whew!). I
wrote the programmer's job specs in August last year, expected one to be
appointed in September, then November, then Xmas came, and it's only
last fortnight that it went through and they faxed me six CV's to choose
from. A couple of hours, nay, a couple of *weeks* more or less, who
cares?

Question:

5.  Guy says that Hartigan's method was accompanied by a word list in some
number of languages, supplied by Dyen.  These are presumably real languages,
because Guy then says he (not Hartigan?)
   "applied it on language
   families computer-generated under the strict condition of a constant
   universal rate of lexical change."
I do not know what effect the assumption of constant universal rate of
lexical change might make on the mutation algorithm, but it is of course
claimed by most participants in these discussions to be a contrary-to-fact
assumption, that is, it is claimed that lexical items with some semantics
change more rapidly than those with other semantics.  I can certainly imagine
that artificially generated data may have fewer quirks that a computer
program could use to detect historical splits.  Whether this is the case is
one of the things I was trying to ask Guy about (perhaps my question was not
clear), and perhaps he has this additional information.

Answer:  "Qui peut le plus peut le moins". The monograph from which I
extracted those quotes ("Experimental glottochronology etc.") intended
to examine a number of approaches to reconstruction under the then
generally accepted hypothesis of a universal rate of lexical retention.
Did I write "generally accepted"? With hindsight, perhaps not. It was
the credo in the linguistics department of the Research School of
Pacific Studies then, and I had only started on glottochronology,
lexicostatistics if you prefer, in ... wait... I think it was 1975. Yes.
Tryon's "New Hebrides Languages: An Internal Classification" bears the
date 1976, I wrote the computer program to produce the 178x178 table of
cognate percentages. The lexicostatistical classification of the
languages of Espiritu Santo, most of which I had provided the sample
wordlists for, went against my gut feeling. That must be when I started
looking at what lay behind lexicostatistics. That led me to numerical
taxonomy, starting with Sneath and Sokal's mininum-spanning trees, the
reduced-mutation algorithm, and two inventions of mine: the wildcard
algorithm and the n-way splitting algorithm. "Experimental
glottochronology" is just that: a series of experiments on one single
language family repeatedly "grown in vitro", as it were, under different
sets of circumstances (long sample list, short sample list, varying time
depths), but always under the assumption of a constant retention rate
for all languages and mesolanguages at all times. The monograph consists
of just 38 pages of text describing the experiments and commenting the
results, presented in an appendix of 140 pages of dendrograms and
tables. I was trying to force one particular comparative method into an
experimental sciences mould ("mold" in US English). Since the
experiments were carried out under the fundamental assumption of the
lexicostatistical method, since borrowing was disallowed, since the
range of the random-number generator was set to 1..5000 (to simulate a 1
in 5000 chance of erroneously counting as cognate what was not) the
risks of chance resemblances were negligible. In a nutshell, the
experiments were run in ideal conditions, in full compliance with all
the requirements of the lexicostatistical methods. As I opened with,
"qui peut le plus peut le moins", and the failure of the
reduced-mutation algorithm under ideal experimental conditions means
that it would also have failed under less than ideal conditions.

Question:

6.  I will have to actually read the article Guy refers to, I guess, namely
             (Experimental glottochronology: basic methods and results.
              Pacific Linguistics, Canberra, 1980. p.19)
because the account Guy extracted from it, namely
   "The program was fed the wordlists of the simulated language family,
   and a phylogenetic tree ([26]) drawn from the account of the
   successive mergings of lists and of the predicted past individual
   word replacements."
sounds if taken literally as if the results that the algorithm was supposed
to achieve were fed to it in advance.  Obviously, my literal interpretation
is not a viable one, so perhaps clarification is possible?

Answer: There were two algorithms, rather, two sets of algorithm. One
repeatedly generated the language family. It was the great-granddad of
GLOTSIM in package glotto02.zip at garbo.uwasa.fi in pc/linguistics. The
other, or rather, the others, took the results as input  (diachronic
wordlists), and reconstructed dendrograms using diverse methods, some out
of the literature of the time, some of my own invention. To make things
perfectly clear, here are the very commands used to create the language
family of the preliminary experiment, copied from p.46 of "Experimental
glottochronology" (comments in brackets):

CREATE: AZ
[in year 0 create a language AZ]

SPLIT: AZ AD ES
[and let it split forthwith into AD and ES]

TIME: 500 SPLIT: ES EH IN OS
[in year 500 let ES split into EH, IN and OS. Note the early ternary
split of the family. You have the gist of it now, I will dispense with
most of the comments from now on]

TIME: 700 SPLIT: IN IK LN
TIME: 800 SPLIT: LN LIMA MN
TIME: 850 SPLIT: IK IJ KILO
TIME: 1000 SPLIT: IJ INDIA JULIET SPLIT: OS OR SIERRA
TIME: 1100 SPLIT: EH ECHO FOXTROT GH [note the ternary split here]
TIME: 1300 SPLIT: MN MIKE NOVEMBER SPLIT: OR OQ ROMEO
TIME: 1500 SPLIT: AD ALPHA BRAVO CD [a ternary split again]
SPLIT: GH GOLF HOTEL SPLIT: OQ OSCAR PQ
TIME: 1700 SPLIT: PQ PAPA QUEBEC
TIME: 1800 SPLIT: CD CHARLIE DELTA
TIME: 1900 REPORT: TEST
[save all extant wordlists to a file called "TEST"]

The simulation program kept a running account of all the lexical
innovations (example of one such account p.48). The exact histories of
the languages generated were thus known in perfect detail.

The reconstruction programs were given, as input, _only_ the
"vocabularies" (not lists of words properly, but of integers symbolizing
words) of the languages extant at the time when the REPORT command was
executed. They did not produce trees as output, but accounts of every
step of the reconstruction. Here is an example (from p.56):

Maximum percentage method.
Retention rate: 0.8000. Wordlists: 40 items.
Level of confidence: 0.95000 (1.96039 standard deviations).
/CHARLIE
/DELTA
/BRAVO
 1.00000   0 years
/ALPHA
/CHARLIE - DELTA - BRAVO
 0.92500  175 years

I will spare you the rest. Working back from today, the algorithm (here
it was the minimum-spanning tree method), finds that CHARLIE, DELTA and
BRAVO share 100% cognates, groups them together, and dates the split at
0 BP. Next, it finds that ALPHA has 92.5% in common with the CHARLIE -
DELTA - BRAVO group and date the split of ALPHA accordingly (using
the known retention rate, here 80%, constant at all times and for all
languages throughout each experiment). The first three lines summarize
the parameters and options used for the simulation and the
reconstruction. "Maximum percentage" means that you take as the distance
of two groups the distance of the closest languages across the groups. I
also tried the minimum-percentage and mean-percentage methods. They are
all methods which had been advocated by practitioners of
lexicostatistics (for linguistics) and numerical taxonomic (for other
disciplines). The level of confidence sets the amount of "fuzziness"
for deciding whether or not two successive splits occurring in short
succession should be reinterpreted as a single split. Again, that was
what people who resorting to lexicostatistical classification used to
do.

Thus, you have on one hand the true history of the language family,
known to an iota; on the other hand the detailed reconstruction by
this or that method. You compare the two, and decide which method
does best, on whichever criteria you think most important.

Question:

7.  Proceeding to what the algorithm did:
    "The reduced mutation algorithm identified the basic binary split in
   all experiments, but did not succeed, even once, in reconstructing
   the subsequent ternary split of ECHO-SIERRA, either as such, or as
   two successive binary splits."
Does this mean that the reduced-mutation algorithm has built into it a
preference for binary splits only, never ternary splits?

Answer: Yes. You will have just seen, in my answer to question 6, how
it was then common to adjust the degree of fuzziness. With nil tolerance
you would always get binary splits, unless, by some extraordinary chance,
you got the same date of split to the last decimal (that can happen with
very short sample lists). The reduced-mutation algorithm as described in
Hartigan did not allow for any fuziness, so it had to produced mostly
binary splits.

Question:
  Reference to "the ... ternary split of ECHO-SIERRA" seems to
imply that a ternary split was deliberatly built into the data sets?  Perhaps
again my lack of understanding of the tree "fed" to the program is at issue.

Answer: Yes, it was deliberated built in. I generated a preliminary
language family (I have reproduced the commands for its generation
above). I studied the performance of the diverse methods I intended to
test on this family. I decided that a thing to test was the ability of
those methods to identify multiple splits as such. I decided to go only
for ternary splits only, not 4-way splits or worse.

Question:

8.  And the following I cannot understand at all, since it seems to
contradict what was said earlier:
    "The reasons for the resounding failure of the reduced mutation
   algorithm are somewhat akin to those for the failure of the
   traditional lexicostatical method: the measure of the similarity or
   of the distance between two languages is based on data from just two
   wordlists."
I thought the data was multiple sets of word lists generated by random
mutation?  Otherwise, there is no work for the mutation algorithm to do, if
there are only two lists.  No matter what the data, it would in that case
simply posit one ancester with the two attested descendants.

Answer: All methods used in "Experimental glottochronology", except one,
rely on the distance (or closeness) between the two members of every
language pair being considered for merging. This is what I mean by "the
measure of the similarity or of the distance between two languages is
based on data from just two wordlists". I know your next question: "On
what else can it be based???" (There ought to be an exasperation-cum-
incredulity mark, a triple question mark doesn't quite do the job).

The one exception was a method of my own invention. It used the
linear-correlation coefficient of the cognate percentages of the members
of each possible pair with all the rest as a measure of the similarity
of the two members of each possible pair. In other words, my metric was
derived from the scores of each member of a pair with each language
external to the pair. So, with a family of 20 languages for instance, I
calculated the similarity of any two languages from 36 cognate scores,
instead of using the percentage of cognates shared by the two languages
in question. In fact, my metric _completely ignored_ the amount of
cognates they had in common.

Further, I noted that, as I was reconstructing a tree, successive
groupings made the metric rely on fewer and fewer regression points.
Thus, with 20 languages, the first 2 languages grouped are chosen on the
evidence of 36 observations. Once grouped, you end up with 18 languages
and one mesolanguage. The next merger is therefore based on the evidence
of (19-2)*2 = 34 measurements. The further the reconstruction, the less
information it relies on. So I racked my brains, and produced a
_splitting_ algorithm instead which used the maximum amount of information
for the most remote reconstruction. It reconstructs the largest, most remote
groups first, then splits them into smaller and smaller subgroups until
only one language is left in each group. All the other methods did it
the other way: start with individual languages, and merge them into
larger and larger subgroups until you are left with just one big group
and you call that the protolanguage.

Many years later I tackled the problem for another angle: what are the
statistical properties of language families, and how can we elaborated a
metric, and a reconstruction method, congruent with those properties?
The result is the algorithm implemented in GLOTTREE of my GLOTTO
software package. It no longer uses linear-correlation coefficients: I
eventually discovered that they were mathematically inappropriate. It
still uses, however, the old n-way splitting algorithm of 15 years ago.
It still relies on looking at the amount of cognates shared by any two
languages with all the languages in the family, themselves excepted.

Ouf! Y'en a marre. Maintenant, je retourne a ce pour quoi on me paye.

Wait, there was another question. Yes, the lists in Hartigan's book
were from real languages, all Indo-European.

--------------------------------------------------------------------------
LINGUIST List: Vol-6-506.