Random Noise - quite different questions?

Sat Sep 4 00:39:05 UTC 1999

> -----Original Message-----
> From: Indo-European mailing list [mailto:Indo-European at xkl.com]On Behalf
> Of ECOLING at aol.com
> Sent: Wednesday, September 01, 1999 9:37 PM
> To: Indo-European at xkl.com
> Subject: Random Noise - quite different questions?

> I confess that I do not entirely understand the reasoning used
> by John McLaughlin in his message on this subject today.
> That is not an oblique criticism, it simply means only what
> it says.  I would appreciate if the logic and assumptions were
> laid out in greater detail.  I promise not to be offended if some
> of it seems exceedingly elementary.

Here's the way it works in a nutshell:

Assume a language that we'll call A.
This language has a consonant inventory of between 10 and 20 (on the low
side, but still quite common).
The phoneme /t/ occurs in initial position in 20% of the words of the
language (a little on the high side, but quite common for a language with a
small consonant inventory).
Based on some list of meanings, build a list of 1000 words in A.
200 of these words will begin with /t/.

Now assume an unrelated language B.
This language also has a consonant inventory of between 10 and 20 and /t/
occurs in initial position 20% of the time.
Now take those 1000 words of A and find words with the same meaning in B.
200 of the words in B will begin with /t/.
Of the 1000 pairs of words, 20% will have the word in A start with /t/, 20%
will have the word in B start with /t/, so 20% of 20% (1000 times .2 times
.2) means that 40 pairs of words (with the same meaning) will start with /t/
in both A and B.
Total "cognate sets" for two languages--40.

Now assume another unrelated language (unrelated to either A or B) C.
This language also has the same size consonant inventory and frequency for
initial /t/ as we have for A and B.
Now find the equivalent word in C for the 1000 meanings we've been using.
Between A and B there will be 40 words that randomly match initial /t/.
Between A and C there will be 40 words that randomly match initial /t/.
Between B and C there will be 40 words that randomly match initial /t/.
There will also be 8 words in which all three languages have an initial /t/
for the same meaning (20% of 20% of 20% of 1000).
Total "cognate sets" for three languages--120 pairs, 8 triplets.

Now add that fourth unrelated language D.
This language has the same initial conditions as the other languages.
We'll add the 1000 words of D to our multilateral comparison.
Now there'll be 40 pairs of words where two languages have initial /t/ for
each of the following pairs:  A-B, A-C, A-D, B-C, B-D, C-D
There'll be 8 triplets of words where three languages have initial /t/ for
each of the following triplets:  A-B-C, A-B-D, B-C-D
There'll also be 1 or 2 quadruplets where all four languages have initial
/t/ (20% of 20% of 20% of 20% of 1000 = 1.6).
Total "cognate sets" for four languages--240 pairs, 24 triplets, 1 or 2
quadruplets.

Just for overkill in the demonstration, we'll add a fifth unrelated language
E.
Same initial conditions.
We now have the following pairs (40 matches each for 400 pairs):
A-B, A-C, A-D, A-E, B-C, B-D, B-E, C-D, C-E, D-E
The following triplets (8 matches each for 72 triplets):
A-B-C, A-B-D, A-B-E, A-C-D, A-C-E, B-C-D, B-C-E, B-D-E, C-D-E
The following quadruplets (1.6 matches each for 8 quadruplets):
A-B-C-D, A-B-C-E, A-B-D-E, A-C-D-E, B-C-D-E
Total "cognate sets" for five languages--400 pairs, 72 triplets, 8
quadruplets
Looking at just one of the five languages, the form in A matches the form in
one other language in 160 words, it matches the form in two other languages
in 40 words, and it matches the form in three other languages in 6 words.

As we keep adding languages, the number of possible pairs, triplets,
quadruplets begins to multiply so that once six languages are in play, there
will be one or two quintuplets in the data and 24 quadruplets.  You can
easily see how a vast web of "cognation" can be build from random
correspondences.

Possible variations:

1.  Increase semantic latitude:  Times the probability of getting a random
match by whatever number of related meanings you look at in addition to the
first one (increases the probability)

2.  Check for two consonants rather than just one:  Times the probability of
getting one consonant by the probability of getting another one (decreases
the probability)

3.  Check for the consonant in positions other than initial position:  Times
the probability by however many other places in the word to check (increases
the probability)

4.  Increase the number of phonemes that can match the first:  Add the
probabilities of each consonant together before computing (increases the
probability)

What about different sizes of consonant inventory?  This obviously
complicates the issue exponentially, but the basic principles demonstrated
above all still apply (the math just gets messier).  If one were to compare
!Xu with Rotokas, for an extreme example, then there are two possibilities.
First, if one compares !Xu /t/ with Rotokas /t/, then the number of random
"cognates" will be very small since !Xu /t/ represents only a very small
amount of the consonant inventory while Rotokas /t/ is one-sixth of the
inventory (but, as with most languages, it probably represents about 20-30%
of the initial consonants in the vocabulary of Rotokas).  Second, if we
compare Rotokas /t/ (the only voiceless coronal obstruent), with !Xu /t/,
/th/ (aspirated), /t'/, /ts/, /tsh/, /s/, /tk/ (velarized), /tsk/, /tS/
(postalveolar), /tSh/, /tS'/, /S/, and /tSk/ (the voiceless coronal
obstruents in !Xu), the number of random matches increases greatly.  In
effect, the first method compares one-fourth of the Rotokas vocabulary (/t/
at 25% initial frequency) with perhaps one-twentieth (/t/ at 5% initial
frequency) of the !Xu vocabulary, while the second method compares
one-fourth of the Rotokas vocabulary (/t/) with one-fourth of the !Xu
vocabulary (voiceless coronal obstruents of any stripe at a combined total
of 25% initial frequency).

Alexis Manaster-Ramer and I saw eye-to-eye on this issue and we had
discussed the possibility of co-authoring a paper on it.

The Greenbergian multilateral method (as illustrated in the "cognate sets"
found in Language in the Americas) makes use of options 1, 3, and 4 above,
but not 2 (if a "cognate" can't be found for both "reconstructed" consonants
in a language, then just one will do).  He also matched different sizes of
inventories in the second method described above.  Indeed, there is some
evidence that the total consonant inventory of each language didn't even
come into consideration when comparing forms, so that, for example, /t/ in a
Salishan language could be compared to /t'/ in a Siouan language even though
the Salishan language has /t'/ in its inventory.

Greenberg's "Amerind" classification never rises above the level of random
noise.

John E. McLaughlin, Ph.D.
Assistant Professor
mclasutt at brigham.net

Program Director
Utah State University On-Line Linguistics
http://english.usu.edu/lingnet

English Department
3200 Old Main Hill
Utah State University
Logan, UT  84322-3200

(435) 797-2738 (voice)
(435) 797-3797 (fax)