6.797, Comparative method: N-ary comparison

Fri Jun 9 06:26:53 UTC 1995

----------------------------------------------------------------------
LINGUIST List:  Vol-6-797. Fri 09 Jun 1995. ISSN: 1068-4875. Lines: 333

Subject: 6.797, Comparative method: N-ary comparison

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at tam2000.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu>

Assoc. Editor: Ljuba Veselinova <lveselin at emunix.emich.edu>
Asst. Editors: Ron Reck <rreck at emunix.emich.edu>
               Ann Dizdar <dizdar at tam2000.tamu.edu>
               Annemarie Valdez <avaldez at emunix.emich.edu>

-------------------------Directory-------------------------------------

1)
Date: Mon, 5 Jun 1995 10:14:58 +1000 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Comparative method: N-ary comparison

2)
Date: Tue, 06 Jun 1995 15:50:32 +0930
From: "David M. W. Powers" (powers at ist.flinders.edu.au)
Subject: Re: 6.768, Comparative Method: N-ary comparison again

3)
Date: Tue, 6 Jun 1995 13:05:12 -0700 (PDT)
From: Scott DeLancey (delancey at darkwing.uoregon.edu)
Subject: Re: 6.768, Comparative Method: N-ary comparison again

4)
Date: Wed, 7 Jun 1995 18:04:53 -0400
From: Alexis Manaster Ramer (amr at CS.Wayne.EDU)
Subject: Re:  6.438 Fun: How to make linguistic theory, Pre-Proto-World Unveiled

-------------------------Messages--------------------------------------
1)
Date: Mon, 5 Jun 1995 10:14:58 +1000 (EST)
From: j.guy at trl.OZ.AU (Jacques Guy)
Subject: Comparative method: N-ary comparison

Alexis Manaster Ramer (amr at CS.Wayne.EDU) just wrote:

)Janhunen 1992 argues that the odds of finding apparent
)matches simply by chance when Japanese is compared to the four
)Altaic languages/subgroups, viz., Turkic, Mongolic, Tungusic, and
)Korean, are four times as high as are the odds of finding such
)spurious matches when Japanese is compared to just one language.

)In other words, Janhunen assumes that a 5-ary comparison is four
)times as likely to produce matches purely by chance (what I call
)'false positives') as is a binary comparison.  This, needless to
)say, is a fallacy, but there you have it.

I was curious to check that, so I turned to my simulation program
"chance". With 200 words, a one-in-250 chance of accidental
resemblances, semantic domains of size 8, I saw, after 600 simulations:

For 2 languages:   31.10 spurious resemblances per simulation
                         attested by 2 languages
For 5 languages:  141.14 spurious resemblances per simulation
                         attested by 2 languages
                   12.01 attested by 3 languages
                    0.22 attested by 4 languages
           Total: 153.37 spurious resemblances per simulation.

Yes, 153 chance resemblances out of a sample list of 200 words! That
is the cost of allowing semantic shifts.

Then, not allowing semantic shifts, I got, with the same parameters, the
following results:

For 2 languages:   0.535 spurious resemblances per simulation
                         attested by 2 languages
For 5 languages:   4.907 spurious resemblances per simulation
                         attested by 2 languages
                   0.022 attested by 3 languages
           Total:  4.927 spurious resemblances per simulation.

I was extremely surprised at both sets of results: an 5-ary comparison
is from 5 to about 10 times as likely to yield spurious resemblances as
a binary comparison!

My curiosity piqued, I tried again, with a vocabulary size of 100
words only.

Semantic domain size: 8
For 2 languages:   14.50 spurious resemblances per simulation
                         attested by 2 languages
For 5 languages:   83.88 spurious resemblances per simulation
                         attested by 2 languages
                    5.80 attested by 3 languages
                    0.13 attested by 4 languages
           Total:  89.81 spurious resemblances per simulation.

About seven times as likely.

Semantic domain size: 1 (i.e. no semantic shifts allowed)
For 2 languages:   0.219 spurious resemblances per simulation
                         attested by 2 languages
For 5 languages:   2.509 spurious resemblances per simulation
                         attested by 2 languages
                   0.008 attested by 3 languages
           Total:  2.517 spurious resemblances per simulation.

About twelve times as likely.

Note that 600 simulations are not enough to get anything near
two decimal places accuracy. I just did not have the time
to wait for a more reasonable 10,000 iterations to run.

For details on the simulation method itself, see my article
in Anthropos Vol.90:223-228 "The Incidence of Chance Resemblances
on Language Comparison".

j.guy at trl.oz.au

--------------------------------------------------------------------------
2)
Date: Tue, 06 Jun 1995 15:50:32 +0930
From: "David M. W. Powers" (powers at ist.flinders.edu.au)
Subject: Re: 6.768, Comparative Method: N-ary comparison again

Alexis Manaster Ramer (amr at CS.Wayne.EDU) writes:

) In other words, Janhunen assumes that a 5-ary comparison is four
) times as likely to produce matches purely by chance (what I call
) 'false positives') as is a binary comparison.  This, needless to
) say, is a fallacy, but there you have it.

Under what assumptions is it a fallacy?  I don't know anything about Altaic or
comparative linguistics, but under certain assumptions it seems quite
reasonable.  In fact, it doesn't matter how ridiculous the assumptions to
counter a blanket statement like this one which doesn't state any.  In fact,
the assumptions I make are blatant oversimplifications, but don't I believe
change the story much compared with reality.

It will depend on what you mean by a match.  I assume a match in both form and
meaning across a PAIR of language - viz a binary match.  Clearly a match across
all 5 languages is much stronger (more unlikely rather than more likely).

Suppose W is the set of possible words for a given alphabet and length criterion
and Vi is the vocabulary of language Lii randomly and equiprobably selected
(with replacement - possible homonymy) from W, and that
the Sum over Li of |Vi| (< |W| (viz. the number of possible words is
considerably greater than the number that occur in any or all of the group of
languages being compared).

We are interested in the probability of there being a word in common between
language L0 and some Li, i in [1,N] for N=1 and N=5.

We made the simplifying assumption that we require a match of both form
and meaning - we would get a similar result if we allowed some factor of
meaning
shift in terms of some lattice of meaning relationships.

Let the set of concepts in each language Li be Ci. We further assume
that Ci = C (a universal set of concepts).  Let's suppose that
for all Li |Vi| = |V| = |Ci| = |C| (a language independent constant).
Let's further define a semantic function Mi which maps a word x (in Vi) to a
concept c in C for language Li, and assume that any word x is randomly and
equiprobably mapped to some c in C.

The assumptions we have correspond to a null hypothesis with all languages
INDEPENDENT.  In particular, languages 1 to N would be expected to have a total
vocabulary V[1,N] of size close to N * |V| after exclusion of their matches.

Then we have

p(x in Vi) = |Vi|/|W| = |V|/|W|
p(x in Vj for some j in [1,N]) = |V[1,N]|/|W|
                               = N * |V|/|W|

Then the probability of some specified concept c and word x of L0 matching in Lj

p(Mj(x) =c | given j, x and c such that M0(x)=c) = 1/|W|

whence the probability that there exists a c and word x of L0 matching in Lj

p(Mj(x)=c | given j) ~ |V|/|W|                  (we assumed |V| (< |W|)

and extending to the case of some Lj for j in [1,N] we have

p(Mj(x)=c | j in [1,N]) ~ |V[1,N]|/|W|          (we assumed |V| (< |W|)
                        ~ N * |V|/|W|           (we assumed |V[1,N]| ~ N * |V|)

Thus the ratio of the number of FALSE matches for a group of N languages to
that for a single language is R = |V[1,N]|/|V| = N.

The main assumption I have made that DOESN'T hold for N)1 under the COMPARATIVE
hypothesis when the N languages form a language group is that the N languages
are mutually INDEPENDENT.  In fact, we are assuming that they will have a
SIGNIFICANT number of TRUE matches, quite apart from the FALSE matches which we
are exploring in relation to the NULL hypothesis.  It all depends on what you
mean by SIGNIFICANT!

What this MAY imply is that |V[1,N]| (< N * |V|
and the ratio of FALSE matches R = |V[1,N]|/|V| (< N.

In other words, the ratio you actually get is equal to ratio of the total
vocabulary of the N language group to that of an individual language.

Clearly if you use N identical languages, that ratio is 1, and you are no
better off.  However, if there is any point in using multiple language it must
be that you expand the number of potential matches - and the above formula for
R applies.

In other words, if your use of N languages is going to increase the potential
for TRUE matches by N then it will also increase the potential for FALSE
matches by N.  Or again, if you choose N languages which are representative of
different features of the language group, you will tend to mutiply the number
of FALSE matches by N; but if you choose N languages which are representative
of the core features of the language group and have little extraneous
vocabulary, then you gain nothing - either in terms of the number of TRUE
matches or the number of FALSE matches.

In yet other words, increasing the number of languages compared doesn't
guarantee improving your signal to noise ratio.  It may however do so if they
represent N different subfamilies and the target language has roots in more
than one of them.  The best signal to noise ratio would, in fact, seem to occur
when using the minimum set of related languages - a chicken and egg problem we
can bypass by accumulating evidence only from languages that in binary
comparison pass some significance test, or where the cumulative N-ary results
prove more significant than the individual binary results (which may happen if
the TRUE matches are relatively INDEPENDENT, in which case R -) N).

It strikes me that R can be determined quite easily for any analysis which has
been brought into question, using the above equivalence:  R = |V[1,N]|/|V|.

Yours thoughtfully but no doubt ignorantly,
David

--      powers at acm.org   http://www.cs.flinders.edu.au/people/DMWPowers.html
Associate Professor David Powers                David.Powers at flinders.edu.au
        SIGART Editor; SIGNLL Chair             Facsimile:    +61-8-201-3626
Department of Computer Science                  UniOffice:    +61-8-201-3663
The Flinders University of South Australia      Secretary:    +61-8-201-2662
GPO Box 2100, Adelaide  South Australia 5001    HomePhone:    +61-8-357-4220

--------------------------------------------------------------------------
3)
Date: Tue, 6 Jun 1995 13:05:12 -0700 (PDT)
From: Scott DeLancey (delancey at darkwing.uoregon.edu)
Subject: Re: 6.768, Comparative Method: N-ary comparison again

Alexis Manaster Ramer (amr at CS.Wayne.EDU) presents as an example
of a published claim "that binary comparison is preferable to n-ary
comparison" the following:

) In his attack on the theory that Japanese is Altaic (and on Altaic
) as a whole), Janhunen 1992 argues that the odds of finding apparent
) matches simply by chance when Japanese is compared to the four
) Altaic languages/subgroups, viz., Turkic, Mongolic, Tungusic, and
) Korean, are four times as high as are the odds of finding such
) spurious matches when Japanese is compared to just one language,
) specifically Korean

There may be a confusion here between two notions of "comparison".
In the use of that term which is standard among many linguists, it
refers to reconstructing a protolanguage on the basis of data from
attested daughter languages.  This is a task which one would undertake
only after being convinced that the attested languages are in fact
genetically related.  And in this sense of "comparison", it is hard
to imagine any reasonable linguist arguing that binary comparison is
preferable to n-ary (unless, of course, there is reason to believe
that two specific languages form a genetic subgroup).  Clearly, the
more data one can bring to this task the better.
     But if I understand AMR's example correctly, what is under
discussion is the other sense of comparison, which refers to seining
data from two or more languages looking for resemblant forms, which
are then to be assessed for their value as evidence that the languages
are related.  This is a very different proposition from the first,
although they are certainly not unconnected.  (Comparison in this
sense, for example, may constitute the groundwork leading to a
hypothesis of relationship which then can be pursued by the method
of comparative reconstruction).  In this sense there is a problem
with n-ary comparison, and it is exactly the one which Janhunen
suggests:

) In other words, Janhunen assumes that a 5-ary comparison is four
) times as likely to produce matches purely by chance (what I call
) 'false positives') as is a binary comparison.  This, needless to
) say, is a fallacy, but there you have it.

Well, not needless to say, because in fact it doesn't look like a fallacy
to me.  If I search through the vocabulary of English and Klamath
looking for possible cognates, I will certainly find a few resemblant
forms that might be candidates.  If I extend my search to include
Yokuts, Maidu, and Wintu, and thus have (roughly) four times as
much vocabulary to search through for English resemblants, don't I
have (roughly) four times as much chance of finding some?

Scott DeLancey                  delancey at darkwing.uoregon.edu
Department of Linguistics
University of Oregon
Eugene, OR 97403, USA

--------------------------------------------------------------------------
4)
Date: Wed, 7 Jun 1995 18:04:53 -0400
From: Alexis Manaster Ramer (amr at CS.Wayne.EDU)
Subject: Re:  6.438 Fun: How to make linguistic theory, Pre-Proto-World Unveiled

How to Disprove that the Indo-European Languages are Related

1. They are too similar to be genuinely related.  While there
are a few cases where supposed IE cognates really look dissimilar
(e.g., Arm erku and Sindhi b'a '2'), there are MANY MORE forms
with the same meanings but looking even more dissimilar if we
compare each IE language with some other group, e.g., Polish
with Basque, Armenian with Aztec, or Sindhi with Bangu-Bangu.

2. The founder of IE comparative studies, Bopp, also thought
that IE included Kartvelian and was related to Austronesian.
This obviously undermines the validity of the IE connection
itself.

3. Given the rate at which languages supposedly lose
vocabulary and the supposed age of Proto-IE, we should
only be able to reconstruct at best a small fraction of
the so-called Swadesh 100-word list.  But actually
Indo-Europeanists turn out to reconstruct over 90%
of these items, so there is a fundamental contradiction
which means that the IE hypothesis must be wrong.

4. All that stuff was published in a foreign language
(German), so can possibly evaluate it or even review it,
and why should we bother with it?  Only linguistic work
published originally in English should count!

--------------------------------------------------------------------------
LINGUIST List: Vol-6-797.