6.436 Comparative Linguistics

Sun Mar 26 16:52:27 UTC 1995

----------------------------------------------------------------------
LINGUIST List:  Vol-6-436. Sun 26 Mar 1995. ISSN: 1068-4875. Lines: 378

Subject: 6.436 Comparative Linguistics

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>

Asst. Editors: Ron Reck <rreck at emunix.emich.edu>
               Ann Dizdar <dizdar at tam2000.tamu.edu>
               Ljuba Veselinova <lveselin at emunix.emich.edu>
               Annemarie Valdez <avaldez at emunix.emich.edu>

-------------------------Directory-------------------------------------

1)
Date: Wed, 22 Mar 1995 11:35:36 +1100 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Why comparative linguistics

2)
Date: Wed, 22 Mar 1995 09:54:22 -0500
From: ECOLING at aol.com
Subject: Denying poss of compar technique

-------------------------Messages--------------------------------------
1)
Date: Wed, 22 Mar 1995 11:35:36 +1100 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Why comparative linguistics

It is no secret to readers of this list that I take a dim view of
comparative linguistics. Some may wonder then why I bother writing about
the subject. Contrary to what one would be justified to think, it is not
because I am intent on debunking the many fallacies that litter the
topic. I could hardly care less about X, Y or Z claiming to have
produced the ultimate classification of Q, W, E, R, T and Y and
consequently having located the cradle of Indo-European speakers,
Nostratophones, even Geophones (Proto-World speakers). I am interested
because the properties of language families are mirrorred in other
aspects of language, far more important, and of immediate practical
interest. I have cobbled together this short explanation out of a
much longer article which I am still polishing. It is mostly about
the reconstruction algorithms in GLOTREE and GLOTLPP (part of the
lexicostatistical package in pc/linguistics/glotto02.zip at
garbo.uwasa.fi). I had sent a preview of it to Cameron Laird, and you
may perhaps (if it's still there) ftp it if you are interested: It's at
ftp.neosoft.com in pub/users/claird/sci.anthropology/text/jg_salish.zip
It's in Postcript format, 33 pages long. Here is what *really* makes
me tick:

Consider this excerpt from a table of cognate percentages
between eight languages:

     C    D    E    F    G    H
A   20   20   21   22   22   23
B   40   42   43   40   44   45

Language B scores about twice as many cognates with the rest as A does.
This pattern occurs if and only if languages C to G are *external* to A
and B, in other words, if A and B are *siblings*, or other terms again,
if A and B have an *immediate* common ancestor. When they do not, the
pattern is breached, thus:

     C    D    E    F    G    H
A   20   20   21   22   22   39
B   40   42   43   40   44   45

There, A and B would be siblings if H were not attested, or were removed
from the table (the proof is given in the original article, I feel too
lazy to translate the equations and diagrams from Word-for-Windows to
straight ascii. It's not a trivial 5-minute task).

This observation gives us a key to reconstructing the genetic trees of
language families whatever the variations in lexical retention rates,
and to estimating those retention rates as well. Thus in the fabricated
example above, B, shares twice as many cognates as A with external
languages very simply because it has been twice as retentive since
the split of their common ancestor. Real data will exhibit similar
patterns, sometimes with surprising differences in retention rates:
thus Sakao 1.5 times as retentive as Akei (both spoken in Espiritu
Santo, Vanuatu, see Guy 1982:288,300)

Note that it does not matter whether the figures in a cognation matrix
represent percentages or the actual numbers of cognates counted. A
language n times as retentive as another will have n times as many
cognates with external languages, whether that amount is expressed as a
percentage or as an absolute number of cognate.

Consider now a matrix of the frequencies with which the words from an
English corpus follow each other, and take, for instance, prepositions.
Prepositions will most often be followed by articles, adjectives or
substantives, very rarely by other parts of speech. If, say, "in" occurs
n times as often as "between", we may expect to observe "in the"
occurring approximately n times as often as "between the". Thus the
frequency vectors for words occurring in similar environments can be
expected to exhibit linear relations similar to those of the cognate
scores of sibling languages with external languages. This is the cause
of the results reported by Finch (1993). Finch submits matrices of
frequencies of co-occurrence of words or letters to various clustering
techniques and produces dendrograms showing for instance  "dog, cat,
mouse" in one cluster, "girl, boy, woman, man" in another (Finch
1993:112), or, counting individual letters rather than words, showing
all vowels in one cluster, all consonants in another (ibid. p.118). He
leaves open to speculation the reasons for the success of these
procedures. The reason is: like tend to occur in like environments.
Vowels, for instance, tend to occur in the immediate environment of
consonants, and consonants of vowels; words of a given grammatical
category tend to occur in the immediate environment of words of those
grammatical categories which the syntax of the language allows. As the
environment is made wider and wider, syntactic contingencies become
weaker and weaker, gradually disappearing until only semantics
contingencies can remain.

Cognation matrices and frequency matrices of word or letter
co-occurrences share another, surprising, property. Consider this
matrix:

    D   E    F    G    H    I
A  19  19   20    2    3    4
B   0   1    1   20   19   19
C  20  20   21   22   22   23

Imagine that it contains cognate scores: A has nineteen words in common
with D and E, twenty with F, etc., and is clearly a member of the DEF
family, whereas B belongs to GHI. Note now how C is the sum of A and B.
Thus C might represent a sample of English and A and B its Romance and
Germanic components.

Imagine now that this matrix contains the frequencies of occurrence of
words A, B, and C in environments D...I. Row C might represent the
frequency distribution of the word "like", being the sum of the
frequencies of "like" (preposition) in row A and "like" (verb) in row B.
The quantitative properties of polysemic tokens are thus the same as
those of hybrid languages and polysemy is mathematically equivalent to
hybridation or undetected borrowing. Therefore, a procedure capable of
identifying languages affected by hybridation or undetected borrowing
ought to identify polysemic words when applied to frequency matrices of
word co-occurrences. Once again, complex, seemingly almost intractable
problems of automatic text analysis seem to be reducible to much simpler
models.

The two works quoted are:

Finch, S.P. (1993), "Finding Structure in Language". Ph.D. thesis,
available in electronic form in /pub/statling/Papers/phdThesis.ps.Z at
ftp.cogsci.ed.ac, University of Edinburgh, Scotland.

Guy, J.B.M. (1982), "Bases for New Methods in Glottochronology" in "Papers
from the Third International Conference on Austronesian Linguistics", Vol.1
Halim, Carrington, Wurm (eds). Pacific Linguistics, Canberra.

--------------------------------------------------------------------------
2)
Date: Wed, 22 Mar 1995 09:54:22 -0500
From: ECOLING at aol.com
Subject: Denying poss of compar technique

I am trying to learn what I can from J. Guy's recent posting on comparative
techniques and mutation algorithms.

First, two points on which we seem to totally agree (I think so, reading
Guy):

1.  Any development of techniques which include empirical content on possible
changes must rely on attested examples in which the antecedents
(proto-Romance for example) are at least approximately known (via attested
Latin, a close relative).  Discoveries that changes more often go in one
direction than in the reverse must in the first instance be based on such
attested examples.  Since the sample is small, there is much room for error.

2.  Many claims which have been made of "natural" linguistic changes going in
only one direction are false.

Next, two points on which we seem not to agree (I think so, reading Guy):

3.  There are in fact linguistic changes which go in only one direction.

4.  Guy seems almost to be denying that comparative-historical linguistics is
possible in the absence of written attestations of the earlier stages.  This
to me is a denial of the very meaning of historical method, which, once
established on cases where something approximating an ancestor is attested,
should, if validly so established, be usable also on cases where no such
ancestor is attested.

3.  There are in fact linguistic changes which go in only one direction.

Guy objected to the following (his quotation and emphasis):

"This "step by step" is like a minimal series of mutations, with the
added information that it is our business to learn which changes
(mutation steps) are more natural, and OF COURSE MOST of these go
only in one direction". (My emphasis again).

When I originally mentioned this topic, I was using it to explore one of
several reasons why a linguistic account of mutations might not be the same
as a computer-alogirthm account of mutations.  I had in mind mutation
"steps", minimal very small changes, with detailed context, rather than
global changes.  Perhaps this clarification will help J. Guy to address the
questions I had at least intended to ask.

I assumed claims of such one-directional changes would be formulated in
sufficiently precise terms, not as global changes from for example having a
case system to not having one or the reverse, or as one sound changing into
another without considering surrounding context.  Like Guy, I would mock
either of the latter as linguistically naive.  But more precisely formulated
claims, many of the sort that particular morphological categories often
derive from phrases containing originally free words, or that particular
strings of adjacent sounds often change by a minimal step (!!!) into other
strings of adjacent sounds (including detailed context) are not at all
absurd, and in fact quite often valid.

There are celebrated exceptional cases of morphological elements becoming
more free, and of sequences of historical changes giving rise to "crazy"
synchronic patterns, which if interpreted directly as reflecting a single
direct change are highly misleading about possible or normal changes.  But
these are recognized as exceptions, and quite possibly the conditions under
which they can occur may be, if not determinable, at least circumscribable in
part.

4.  Guy seems almost to be denying that comparative-historical linguistics is
possible in the absence of written attestations of the earlier stages.  This
to me is a denial of the very meaning of historical method, which, once
established on cases where something approximating an ancestor is attested,
should, if validly so established, be usable also on cases where no such
ancestor is attested.

Essentially all historical sciences base their general principles on what
there is the best evidence for, and then extrapolate to attempt to render an
account of phenomena for which there is less, little, or no direct
attestation of earlier stages or origins.  Also, all historical sciences
prefer to operate with surviving records of earlier stages, but do not
restrict their considerations to such.

When Guy says:
   "Biologists are helped by the fossil record, linguists by documentary
   evidence, dated or datable. But most of the world's languages lack this
   evidence. And beyond some 5000 years in the past, the evidence is, in
   all cases and for all practical purposes, zilch."

The last statement is true in one sense, but false in another, because the
synchronic evidence of descendents is also evidence, if we have learned
anything at all about preferred paths of change.

I feel the same way Guy said in his message, why should this have to be
repeated?

It is self-evident in any field attempting historical reconstructions,
whether linguistics, biology, or anything else.  Biologists do in fact
attempt reconstructions of possible histories based on synchronic
descriptions, without using only the fossile record, based on their kind of
morphology, DNA studies, etc. etc.  The biologists' claims that among plants,
fungi, and animals, fungi and animals share some common history (of
innovations?), would be merely one of many such cases, using the fossil
record as far as it goes, but going beyond it to a deeper level.

If comparative-historical ***techniques*** are not merely techniques for
cataloging attestations, but are in fact making empirical claims about how
languages have been known to change in the past,

then it is certainly legitimate to attempt to extrapolate the application of
such techniques to cases where there is no attested (near-relative of an)
ancestral language in attempting to make sense of the patterned data of sets
of descendent languages.  Removing the attestation of a (near-relative of an)
ancestral language is only removing one piece of the evidence.  I have used
"near relative of an ancestral language" in this paragraph and earlier to
emphasize that these questions are, like most in the real world, not ones
with absolute answers.  In a world of greys, it would be just as possible to
deny that the edifice of the Romance family tree has an attested ancestor as
to assert that it does.  Is any minor deviation of the reconstructed ancestor
from attested Latin to be taken to invalidate the techniques? Of course not.
 Yet the absolute tone of Guy's comments might also be used by someone to
suggest it is.

Lastly, places where I am still unclear on what Guy is referring to:

5.  Guy says that Hartigan's method was accompanied by a word list in some
number of languages, supplied by Dyen.  These are presumably real languages,
because Guy then says he (not Hartigan?)
   "applied it on language
   families computer-generated under the strict condition of a constant
   universal rate of lexical change."
I do not know what effect the assumption of constant universal rate of
lexical change might make on the mutation algorithm, but it is of course
claimed by most participants in these discussions to be a contrary-to-fact
assumption, that is, it is claimed that lexical items with some semantics
change more rapidly than those with other semantics.  I can certainly imagine
that artificially generated data may have fewer quirks that a computer
program could use to detect historical splits.  Whether this is the case is
one of the things I was trying to ask Guy about (perhaps my question was not
clear), and perhaps he has this additional information.

6.  I will have to actually read the article Guy refers to, I guess, namely
             (Experimental glottochronology: basic methods and results.
              Pacific Linguistics, Canberra, 1980. p.19)
because the account Guy extracted from it, namely
   "The program was fed the wordlists of the simulated language family,
   and a phylogenetic tree ([26]) drawn from the account of the
   successive mergings of lists and of the predicted past individual
   word replacements."
sounds if taken literally as if the results that the algorithm was supposed
to achieve were fed to it in advance.  Obviously, my literal interpretation
is not a viable one, so perhaps clarification is possible?

7.  Proceeding to what the algorithm did:
    "The reduced mutation algorithm identified the basic binary split in
   all experiments, but did not succeed, even once, in reconstructing
   the subsequent ternary split of ECHO-SIERRA, either as such, or as
   two successive binary splits."
Does this mean that the reduced-mutation algorithm has built into it a
preference for binary splits only, never ternary splits?

The majority of comparative-historical linguists who raise the issue of
binary vs. ternary splits, in my experience, take a position that binary
splits are preferred, or should be attempted before multinary, or else that
only binary splits are permitted in proper historical reconstruction (weaker
or stronger versions).  So this is hardly a criticism against a computer
algorithm.  Reference to "the ... ternary split of ECHO-SIERRA" seems to
imply that a ternary split was deliberatly built into the data sets?  Perhaps
again my lack of understanding of the tree "fed" to the program is at issue.

8.  And the following I cannot understand at all, since it seems to
contradict what was said earlier:
    "The reasons for the resounding failure of the reduced mutation
   algorithm are somewhat akin to those for the failure of the
   traditional lexicostatical method: the measure of the similarity or
   of the distance between two languages is based on data from just two
   wordlists."
I thought the data was multiple sets of word lists generated by random
mutation?  Otherwise, there is no work for the mutation algorithm to do, if
there are only two lists.  No matter what the data, it would in that case
simply posit one ancester with the two attested descendants.

   "The measure of distance used by the reduced mutation
   algorithm is furthermore not reconciliable, at least in my eyes, with
   the linguistic model. Interested readers should refer to Hartigan
   1975:233-246."   (Ditto, p.33)
   The book in question is: Hartigan, John A.  Clustering Algorithms.
   Wiley, New York, 1974.

My attempts in the previous message were precisely to ferret out ways in
which the algorithm might be making assumptions contrary to what we linguists
know about historical change.  J. Guy's response does not (unless i missed
them) give any catalog of these, though he does discuss at length the
question of "naturalness" of changes.  On some of that, I agree with him,
though not on all, as explained above.

In summary,
I agree with some of what Guy says,
believe that some of it needs qualification,
and would still like more clarification on the assumptions behind the
mutation algorithm he was talking about, since his recent communication does
not unambiguously provide this clarification.  Indeed, my attempts at giving
a completely literal interpretation does not seem to work with much of what
Guy wrote on this, and I have not found a more abstract or metaphorical
interpretation to make sense of it either.

I would like clarification on points 5. 6. 7. and 8. above.

I can of course go to the references Guy mentions, but since he has already
read these, I am sure a number of us would very much appreciate it if Guy can
further clarify some of the matters concerning the algorithm.  I have done my
best to understand what he has already provided, and to clarify some of the
questions I posed in case they were not easily interpretable.

What is missing is not the things which Guy feels he has repeated many times,
which I suspect most of us have in fact understood and agree with in part and
not in part (as is anyone's right), but rather further clearly presented
information on the assumptions involved in the mutation algorithm.  Some of
these certainly may go beyond my attempts to guess at them (points 5 to 8
above and the guess that the mutation algorithm does not incorporate any
notions of preferences for some mutations over others).

Sincerely, Lloyd Anderson

--------------------------------------------------------------------------
LINGUIST List: Vol-6-436.