<Language> Critique of an Internet publication on Historical Linguistics

Sun Aug 8 21:00:34 UTC 1999

The attached file is a review/critique of a quantitative model of
historical
and comparative linguistics.

--
Sincerely,
M. Hubey
hubeyh at mail.montclair.edu
http://www.csam.montclair.edu/~hubey
-------------- next part --------------
{\rtf1\ansi\deff0\deftab720{\fonttbl{\f0\fswiss MS Sans Serif;}{\f1\fdecor\fcharset2 Symbol;}{\f2\fmodern Courier New;}{\f3\froman Times New Roman;}}
{\colortbl\red0\green0\blue0;}
\deflang1033\pard\plain\f2\fs20 My comments on the article by Mark Rosenfelder (MR) on his thoughts on
\par the probability of occurences of false (pseudo) cognates (words in different
\par languages that look like they are descended from the same root, but
\par are due purely to randomness and chance) follow. Because it is long
\par I don't have time to keep hashing this over and over again. 
\par M. Hubey.
\par 
\par > -----------------------------begin quote---------------------------
\par 
\par > On sci.lang we are often presented with lists of resemblances between
\par > far-flung languages (e.g. Basque and Ainu, Welsh and Mandan, Hebrew and
\par > Quechua, Hebrew and every other language, Basque and every other
\par > language),  along with the claim that such resemblances "couldn't be due to chance",
\par > or  are "too many" to be due to chance.
\par 
\par Good start. 
\par 
\par > Linguists dismiss these lists, for several reasons. Often a good deal of
\par > work has gone into them, but little linguistic knowledge. Borrowings and
\par 
\par That may be true. However that is not the main reason linguists dismiss
\par them. Most of them dismiss them because some "great authority" in 
\par linguistics has dismissed them. Even if this is wrong, they are happy
\par and certainly feel no angst about it. The problems start when several
\par authorities start to disagree. That is when linguists feel their
\par angst.
\par 
\par > native compounding are not taken into effect; the semantic equivalences
\par 
\par Borrowings can only be taken into account when one is reasonably sure
\par that they know what the language "really looked like". That is circular.
\par IF the whole method by which linguists decide which languages are related
\par to which, and how itself is in question, it is useless to use circular
\par reasoning. The standard way of reasoning among most linguists is;
\par 
\par 1. We "KNOW" X and Y are not genetically related.
\par 2. Probability theory says X and Y are genetically related.
\par 3. Therefore if someone finds uses probability theory to reach conclusions
\par \tab that X and Y are genetically related, then probability theory 
\par \tab must be no good.
\par 
\par This is not only circular but even obnoxious because genetic relatedness
\par itself is full of holes. For those who are mathematically inclined which
\par should be many people on sci.math and sci.math.stat :-) here are more
\par things to read before you decide.
\par 
\par On the Comparative Method:
\par \tab http://www.csam.montclair.edu/~hubey/ZIP/comp-pdf.zip
\par On Probability of Accidental Cognates:
\par \tab http://www.csam.montclair.edu/~hubey/ZIP/acc.cognate.ps
\par On Mathematics of Historical Linguistics: use, misuse, and abuse
\par \tab http://www.csam.montclair.edu/~hubey/ZIP/hist-pdf.zip
\par 
\par The Comparative Method is "THE" method used by historical linguists
\par to make some of their outlandish and outrageous claims. For example, the
\par language of the Rom (the Gypsies) belongs to the Indic family despite
\par it possessing only 10% vocabulary that is Indic. This is like calling
\par someone "black" as long as there is any physical characteristic that
\par even reminds anyone of being black. The "geneticity" of linguistics should
\par be taken with a large vat of salt.
\par 
\par > proffered are quirky; and there is no attempt to find systematic sound
\par 
\par Define quirky. If you look at what historical linguists try to pass off
\par as science, you cannot have many doubts that it is historical linguists
\par who are quite quirky. Their concept of genetic is that of bacteria
\par continuously splitting. There is no concept of having two parents for
\par a language. Even worse, they seem to be doing something akin to tracing
\par a particular section of the mtDNA to account for geneticity.
\par 
\par > correspondences. And linguists know that chance correspondences do
\par > happen.
\par 
\par IT has already been explained a zillion times already that "systematic
\par sound correspondences" are:
\par 
\par 1. as poorly defined as the word "regular" that is batted around
\par \tab constantly by linguists.
\par 2. as bad as "system", much talked about but rarely used or understood,
\par \tab (see the book by Lass, a great linguist by any standard)
\par 3. simply "heuristics" that try to stand for probabilistic
\par \tab methods.
\par 
\par Heuristic means "rule of thumb". There is no need for rule of
\par thumb if probability theory is used. IF MR is using probability
\par theory why is he already trying to provoke a heuristic to justify
\par his probability theory calculations (if they can be called that)?
\par 
\par See above for this circularity fallacy. He has already committed it.
\par  
\par > All this is patiently explained, but it doesn't always convince those
\par > with  no linguistic training-- especially the last point. Human beings have
\par 
\par All of this is again meant to browbeat ignorant people who are not
\par sure of what they are saying. Secondly, linguistic training is not
\par necessary to do probability theory. Thirdly, this sounds strangely
\par like "argument from authority" again, and again, and again. One can
\par imagine philosophers arguing with Newton about his laws of physics
\par telling him that he is not qualified.
\par 
\par > been designed by evolution to be good pattern matchers, and to trust the
\par > patterns they find; as a corollary their intuition about probability is
\par > abysmal. Lotteries and Las Vegas wouldn't function if it weren't so.
\par 
\par Ridiculous reasoning. Probability theory is about patterns just like
\par all math. Math is the science of data compression. The only way to 
\par compress data is to find patterns in it. IN that sense probability
\par theory and its use is really a statement about our inability to
\par do better.
\par 
\par Lotteries and Las Vegas has nothing to do with this part of intuition. 
\par There are people who litterally cannot read but do quite well playing 
\par backgammon. The reason is quite obvious. After playing for 20-30
\par years they develop knowledge of expectations. Besides all this is
\par useless blather. It does not belong in a purportedly scientific paper.
\par 
\par There are psychological reasons why people gamble. Besides, all of this
\par is not germane. If it supposed to be evidence for a particular statement
\par it would be easier to just make the statement and try to back it up
\par with statistical and published data. Indeed, this kind of reasoning is
\par typical of of linguistics in which a sweeping generalization is made
\par and then a feeble example is offered as proof.
\par 
\par > So, even one resemblance (one of my favorites was gaijin vs. goyim) may
\par > be  taken as portentous. More reasonably, we may feel that one resemblance
\par 
\par This is more ridiculous reasoning. Again instead of doing probability theory,
\par after talking about how poor people are in using intuition, he goes into
\par intuition mode and botches up his example. Here is already evidence of an
\par underdeveloped comprehension of what probability theory is about. Think of
\par an archaeologist. Suppose he is not too intelligent and goes digging 
\par everywhere, and examines all stones about the size of cigarette pack or
\par smaller. Chance are he will dig a very long time before he finds anyone
\par which looks like an arrowhead. As soon as he finds one, he immediately
\par "knows" (decides) it is portentatious. Is he wrong? Of course not. The 
\par probability that the arrowhead is due to chance is small. Therefore he
\par is right to suppose that it is not due to chance. But a linguist who
\par finds a word that resembles another dismisses it. Is it because its
\par probability is high? No, it is small. The only difference is that people
\par who live in cities do not go looking at rocks, but they do deal with
\par thousands of words every day. The difference again is that of not
\par being able to comprehend probability theory and probabilistic reasoning.
\par A linguist who knows several languages or has read many books might
\par come across words from different languages that are alike but it happens
\par very rarely. If that were not the case we would  have to have thousands
\par of examples in books. But they don't exist.
\par 
\par > may  be due to chance; but some compilers have amassed dozens of
\par > resemblances.
\par 
\par Here is yet another problem. What does "resemblence" mean? This is a
\par typical problem for linguists. Check Trask's book on glossary of
\par linguistics. This "GOD" of LInguistics still has not understood the
\par relationship between "phonetic resemblance" and "phonetic distance".
\par 
\par LIke all other linguists, and like MR, he does not know what he means
\par when he uses the term "resemblance". This word is a code word for
\par linguists; it is a euphemism that says "this is not a cognate". yet
\par despite this definition tagged onto this word, one can visit the
\par Linguistlist.org website, check into the Histling list archives and
\par find more circular reasoning by the Great God of LInguistics who argues
\par like this basically:
\par 
\par 1. These two words x and y are resemblances.
\par 2. Hence they are not cognates.
\par 3. Hence the languages in question are not related.
\par 
\par The euphemism "resemblance" means exactly "words that look alike
\par from unrelated languages but which are not cognates". Talk about circular
\par reasoning!
\par 
\par Of course, the real generic meaning of "resemblance" is that it is
\par the inverse of "distance" because resemblance is similarity. That
\par can also be seen in my papers above. However, one still needs to
\par read a lot into Trask's glossary to be even generous in imputing
\par comprehension of the concept to him. Anyone who does not think
\par so can get the book and read them for himself/herself.
\par 
\par Finally we get to the most basic and obnoxious comment; some "compilers
\par have amassed dozens of resemblances". Let's recall a remark by
\par Poincare; "A house may be a pile of stones but a pile of stones is
\par not a house. Similarly, science may be a collection of facts, but
\par a collection of facts is not science." Now MR is implying that
\par a compilation of resemblances is not anything else but ?? what???
\par 
\par What are cognates if not collections of resemblances? Now here is
\par the trick often pulled by linguists who have not yet understood
\par what these words mean. They claim that cognates do not resemble
\par each other sometimes. The answer is, so what? There are still plenty
\par of cognates that do resemble each other. The easiest way to see
\par this clearly is take any language and compare it to itself or
\par a dialect. Obviously a language is genetically related to itself.
\par 
\par 
\par Besides, one of the purposes of finding recurrent sound changes is 
\par allegedly because it "denotes  geneticity", but this is a false statement. 
\par 
\par Words that are copied from other languages also undergo "recurrent/regular" 
\par sound changes?
\par (See my papers on this subject on how one can create and use
\par dynamic stochastic process modelst to resolve this issue rather
\par easily. Obviously over time both semantic and phonetic distances
\par between words increase. We all know that.)
\par 
\par On the other hand, the concept of distance neatly resolves all the
\par problems. All you have to do is to reverse the sound changes which
\par are alleged to be regular, and you should eventually change all
\par the cognates in one language into exact copies of the words in the
\par other language. But then this can still be subsumed under the concept
\par of distance (which is simply the inverse of similarity) so it presents
\par absolutely no conceptual or practical problem. Especially in light of
\par the obviously well-known fact that borrowed/copied words from other
\par languages also go thru regular sound changes, it is impossible to
\par comprehend how linguists can continue to repeat nonsense forever.
\par 
\par IT is especially discouraging to see someone who claims to be doing
\par mathematics to solve this problem to indulge in this useless
\par repetetion of almost worthless concepts. The net result is that whatever
\par is valuable in the heuristic of comparative linguistics can easily
\par be subsumed into the framework of distance/similarity and probability
\par theory and whatever cannot be done is probably not worth writing
\par about.
\par 
\par > Such lists may be criticized on other grounds, but even linguists may
\par 
\par Like what grounds? 
\par 
\par > not  know if the chance argument applies. Could a few dozen resemblances be
\par > due  to chance? If not, what is the approximate cutoff?
\par 
\par OK. It's about time to get  into the nitty-gritty.
\par 
\par > The same question comes up in evaluating the results of Greenbergian
\par > mass  comparisons; or proposals relating language families (e.g. Japanese and
\par > Tibeto-Burman) based on very small numbers of cognates. Again, it would
\par > be  useful to know how many chance resemblances to expect.
\par 
\par Good idea. It is even a better idea to first see what has been done
\par on the topic. Any researcher knows that it is not a good idea to
\par re-invent the wheel. Why don't you try reading the material on my
\par website. It has references to other works in this field.
\par 
\par > I will propose a simple but linguistically informed statistical model for
\par > estimating the probability of such resemblances, and show how to adjust
\par > it to match the specific proposal being evaluated.
\par 
\par IT is common in writing scientific papers to give references to works
\par in related fields. Where are your references to relevant works?
\par 
\par > A trivial model: Abstract languages sharing a phonology
\par 
\par > Let's start with a simplified case (we'll complicate it later). We will
\par > compare two unrelated languages A and B, each of which has 1,000 lexemes
\par > of  the form CVC, and an identical semantics and phonology. That is, if
\par 
\par A good start. This is a copy of my model circa 1993 which was published in
\par my book January 1994. I might have even sent you a copy. Your name is in
\par the acknowledgements page of my book along with all the others who were involved 
\par in the discussions on this topic on sci.lang circa 1993. How about truth
\par in advertising? Does this mean anything to you? The name of the book is
\par Mathematical and Computational Linguistics, and can be found in the MSU
\par library for anyone who wants to check by obtaining it via interlibrary
\par loan.
\par 
\par The chapter in which my model can be seen is also on my homepage and has
\par been there for a while.
\par 
\par > there
\par > is a lexeme a in A with some meaning M, there will be a lexeme bp in B
\par > phonetically identical to a, and a lexeme bs with the same meaning as a.
\par > What is the probability that bp is bs?-- that is, that there is a chance
\par > resemblance with a? It can be read off from the phonology of the typical
\par > root. Supposing there are 14 consonants and 5 vowels, it is 1/14 * 1/5 *
\par > 1/14, or 1 in 980. (This assumes that the vowels and consonants are
\par > equiprobable, which of course they are not.) For ease of calculation
\par 
\par Where does it assume that vowels and consonants are equiprobable? Not only
\par have you assumed that all the words in this language will be of form 
\par CVC (consonant-vowel-consonant) but that there are more consonants than
\par vowels?
\par 
\par > we'll  round this to 1 in 1000.
\par 
\par So you have now abstracted whole complete languages into 1,000 CVC syllables.
\par All the words of any language will have to be compressed into 1,000
\par CVC syllables. Now at this point you might deny this, but I will
\par show later why you have in fact done this. And then later I will show
\par how you have illegally added more and more  things into the
\par model, and then as if it means nothing you have then tried to back 
\par up the result with a pitiful performance from real languages. Let's
\par continue.
\par 
\par > How many chance resemblances are there? As a first approximation we
\par > might  note that with a thousand chances at 1 in 1000 odds, there's a good
\par > chance of getting one match.
\par 
\par Why are we indulging in intuition again?
\par 
\par > However, this is not enough. How likely is it exactly that we get one
\par > match? What's the chance of two matches? Would three be quite
\par > surprising?  Fortunately probability theory has solved this problem for us; the
\par > chance  that you'll find r matches in n words where the probability of a single
\par > match is p is
\par  
\par 
\par >      (n! / (r! (n-r!))) pr (1 - p)(n-r)
\par 
\par > or in this case
\par  
\par >      (1000! / (r! (1000-r!))) .001r .999(1000-r)
\par 
\par This is the Binomial Distribution. It comes from independent Bernoulli
\par trials where the probability of success is p, and of failure is q=1-p.
\par Now you have chosen p=1/N where N is the number of lexemes. Let's not
\par forget this. Let us also recall for future that the mean (average) of
\par a Binomial is Np and the variance is Npq. Since q~1, then Np~Npq.
\par In this case, the mean is 1 and the variance is 1. This means that you
\par will get on average 1 match, 
\par 
\par That means that if you checked thousands of languages against each other, 
\par the average number of these accidental matches will be 1 per pair of
\par languages. Now, what exactly are you matching with what? How many words does 
\par a language have and how many meanings does a word have on average, averaged 
\par across all the languages of the world? We will look into this soon.
\par 
\par Meanwhile there is the independence assumption. If a marksman was
\par taking potshots at a target with prob p  of hitting it, then
\par you can use the Binomial for how many hits etc. If the prob of
\par either a boy or a girl being born is independent of what was born
\par previously you can use the Binomial. If the prob of heads is independent
\par of what came up in the previous toss, you can use the Binomial.
\par But the prob of matching is not independent. Once you match one word
\par with another, you don't match it with a second or a third, or a fourth.
\par The independence assumption is broken. But for large numbers it is
\par not that important. If however you were to try this with a small list
\par then you would have to use another way to model this. This is related
\par to the Birthday PRoblem discussed in many books and also discussed in
\par my book and also on the papers that can be found on my website. I 
\par suggest anyone interested in this problem of linguistics and human
\par prehistory read all of these. I also seriously ask all math majors
\par who are interested in these topics to get more involved in this
\par problem, and save humanity from medieval age practice parading 
\par around as science. 
\par 
\par > For the first few r:
\par > p(1) = .368
\par > p(2) = .184
\par > p(3) = .0613
\par > p(4) = .0153
\par > p(5) = .00305
\par > p(6) = .000506
\par 
\par YOu forgot to calculate p(0) which is also about 0.368. That means
\par that about 73% of the time the average number of matches will be
\par zero or one. Let us not forget this.
\par 
\par > So the probability of between 1 and 6 matches is .368 + .184 + .0613 +
\par > .0153 + .00305 + .000506 = .632, or about 63%. It would be improbable,
\par > in other words, if we found no exact matches in the entire dictionary. (But
\par > not very improbable; p(0), which we can find by subtracting the above
\par > p's from 1.0, is 37%.)
\par 
\par NOw let us look at this slightly differently. For a small p and large N
\par the Poisson density is a very good approximation for the Binomial. What
\par then are we matching with what? This problem can be stated like this:
\par We create N containers (phonetic containers) which hold N balls (the 
\par meaning of the word which is contained in the phonetic form which is
\par used to transport information in speech). We throw up the balls so
\par that each ball falls into one container. What is the probability that
\par each ball falls into its own container. This is the problem to try
\par to solve. And here to another assumption of the model becomes clear.
\par 
\par Not only do the languages have exactly the same lexemes, but also the
\par same N meanings! This is very important. Let us recall again that a 
\par whole language of maybe 400,000 words has been compressed into 1,000
\par CVC syllables. Presumably, these 1,000 CVC syllables will have very
\par basic concepts left over from very early days of language.
\par 
\par > Proffered resemblances are rarely exact. There is always some phonetic
\par > and  semantic leeway. Either can be seen as increasing the set of words in B
\par > we would consider as a match to a given word a in A.
\par 
\par Well, let's see. You squeezed a typical language into 5 vowels and 14 consonants.
\par That means Arabic with ~30 consonants has been squeezed into 14. So each of
\par these model consonants have 2 Arabic consonants each. Classical Arabic also
\par has only three vowels, /iua/. That means that they have to be spread out
\par over 5. So suppose we want to compare this to a language like Old Turkic
\par which had 8 long vowels and 8 short vowels. Now, probably the most common
\par vowels system (for 5 vowels) is /iuaoe/. So let's denote these model 
\par vowels as IUAOE.  And let's make a match up. (Note: the colon after a 
\par vowel denotes a long vowel. ).
\par 
\par I:  Arabic i ---  \tab Turkish: i, i:, I, I: 
\par U: Arabic u  ---    \tab Turkish: U, u, U:, u:
\par A: Arabic a\tab  ---    \tab Turkish: a, a:, o, o:
\par E: Arabic i --- \tab \tab Turkish: e, e:, O, O: 
\par O: Arabic a ---\tab \tab Turkish: o, o: 
\par 
\par These are the vowels of old Turkish:
\par 
\par  a\tab a:\tab I\tab I:\tab o\tab o:\tab u\tab u:
\par  e\tab e:\tab i\tab i:\tab O\tab O:\tab U\tab U:
\par 
\par So u and U are not the same vowel. Now this is not unusual since English 
\par has about 20 vowels. So we now have to consider these as matches:
\par 
\par \tab ARabic i\tab Turkish i, i:, I, I:, e, e:, O, O:
\par \tab Arabic u\tab Turkish u, u:, U, U:
\par \tab Arabic a\tab Turkish a, a:, o, o:
\par 
\par This is just an example of the kinds of things that have to be done
\par to make this idea reasonable. So we have on average each Arabic
\par vowel matches 5 Turkish vowels, as expected. But on the consonant
\par side, we expect every Turkish consonant to match about 1.5 Arabic
\par consonants. Now if we were to consider a language like Ubykh with
\par about 80 consonants things would change.
\par 
\par Let's be clear about what was just done. We have to do this just
\par to make the basic mathematical model work because in a real language
\par there will be real vowels which do not match the vowels of the other
\par language and we have to account for the disparity if we want to 
\par use the prototype language with 14 Cs and 5 Vs.
\par 
\par > For instance, suppose for each consonant we would accept a match with 3
\par > related consonants, and for each vowel, 3 related vowels. Since we're
\par > assuming a CVC root structure, this gives 3*3*3 = 27 words in B which
\par > might match any given a.
\par 
\par Ehhhh? Excuse me? What are you doing here? Are you now making even
\par more fumbling with what we have already fumbled with? So when comparing
\par ARabic with Turkish, then for ARabic a, you would now allow it to 
\par match Arabic i and Arabic u? This would mean that any vowel of Arabic
\par matches any vowel of Turkish. Is this supposed to be realistic?
\par 
\par Why bother with this long-winded confusion? You can do all your
\par approximations at once. Simply declare that your CVC syllables
\par will consist of 1 vowel and 3 consonants:
\par 
\par P: p,t,k,b,d,g,kh,gh,h
\par F: f, th, x, s, sh, c, ch, v, z
\par L: l, m, n, ng, r
\par A: a, i, u, o, e, a:, ?, .....
\par 
\par Now isn't that a lot simpler? YOu can now have only 9 CVC syllables.
\par But that is too simple and not enough confusion. So add some more
\par Cs and Vs. But do not split up your approximations into several
\par stages and confuse yourself. At least when you are finished 
\par the results will be something others can relate to and they
\par will now what to expect if they make approximations of that type.
\par 
\par Right now, you certainly are not confusing me.
\par You might be confusing yourself and other linguists. Above all, do
\par not forget the maxim of the great physicist, Richard Feynman:
\par 
\par "The first rule is, you must not fool yourself. And you are the
\par easiest person to fool."
\par 
\par > And suppose for each word a we will accept 10 possible meanings for b.
\par > This must be applied to each of the 27 phonetic matches; so a can now match a
\par > pool of 27*10 = 270 lexemes. The probability that it does so is of
\par > course  270 in 1000, or .27. Every lexeme in A, in other words, has a better
\par > than 1  in 4 chance of having a random match in B!
\par 
\par Meanings? What are we doing now? We already have abstracted about 1,000 meanings
\par from a real language of hundreds of thousands of words. Let's look at this
\par realistically. Say, English has 400,000 words. (Turkish has 200 billion).
\par 
\par (((The Turkish data comes from J. Hankamer. The 400,000 for English I can't
\par recall but I picked it to create nice round numbers. Maybe the number is
\par 1 million maybe 200,000 depending on how one counts.)))
\par 
\par Let's assume these are equally divided between nouns, verbs, adverbs and
\par adjectives. And these are related to each other like run, quick, quickly, etc.
\par so we now have about 100,000 meanings. Then from these we squeeze every 
\par 100 related meanings into a single meaning. So if we have 100 differents ways
\par to say "ingest food" we squeeze all of that into a single meaning "eat".
\par 
\par We have already done this. Don't forget, this prototypical language is
\par an averaged language. It has been squeezed, idealized, cleansed, purged
\par shaped, formed and pre-averaged so we can get quick, easy, fast results.
\par That is what mathematical modeling is about.
\par 
\par So now out of this 1,000 basic meanings, you want to combine every ten
\par meanings into one?? Let's say one of the basic concepts/meanings in this
\par 1,000 word list is "finger". What other 9 basic concepts do you want to
\par combine this with? how about cloud, sea, sleep, drink, have intercourse,
\par and so on?  Surely this is not what you mean. What is it you are trying
\par to do?
\par 
\par I am sure you are aware of the basic idea behind comparative linguistics.
\par Its formalization was done by Morris Swadesh, who formed the so-called
\par Swadesh-100 and Swadesh-200 lists. These lists are concepts that would
\par be considered "basic terminology". These concepts are not new technical
\par terms, but consists of what "early humans" would have known, like animals,
\par food, clouds, eating, sleeping, family, weapons, rocks, water, etc etc.
\par 
\par HOw about starting with the Swadesh-200 list and adding some more 
\par concepts which are as semantically far from those concepts as possible.
\par That would mean that after adding 800 of those you'd have the 1,000
\par list and then you can try playing your games. Instead, why don't you
\par try scaling your system down to 200 words and then redoing some
\par calculations.
\par 
\par > How many chance resemblances are there now? The same formula can be
\par > used, with the revised estimate for p:
\par 
\par >      (1000! / (r! (1000-r!))) .27r .973(1000-r).
\par 
\par > There is a significant probability for very high numbers of matches, so
\par > we  must continue calculating for r well into the hundreds. The results can
\par > be summarized as follows:
\par 
\par Yes, this is a high number. But what have you calculated exactly? After
\par compressing a 100,000 word language into 1,000 meanings-lexemes you
\par have now bunched them together yet one more time. Do people really take
\par word like "finger" (a basic vocabulary item considered in almost all
\par works of this type) and then match it with throat, belly, feces, eat,
\par sleep, rotate, run, hunt, etc? Can you give us examples of who has
\par done anything remotely resembling this.
\par 
\par Why don't you instead use CCVCC and CCCVCC syllables etc and make
\par it more realistic. With N=100,000 you can then use p=0.00001 and
\par do the calculations correctly for a realistic language. And if you
\par do have more problems, then you can read some more.
\par 
\par 
\par > p( 1 to 50) = 5.85 * 10-74
\par > p( 51 to 100) = 1.22 * 10-40
\par > p( 101 to 150) = 8.62 * 10-20
\par > p( 151 to 200) = 1.70 * 10-07
\par > p( 201 to 250) = .082
\par > p( 251 to 300) = .903
\par > p( 301 to 350) = .016
\par > p( 351 to 400) = 1.17 * 10-08
\par > p( 401 to 450) = 2.11 * 10-19
\par > p( 451 to 500) = 1.30 * 10-34
\par  
\par > In other words there's a 90% chance that we'll find between 250 and 300
\par > matches, an 8% chance of finding less, and a 2% chance of finding more.
\par 
\par This is surely not serious. Was this an elaborate joke on linguists? 
\par 
\par > Our rule of thumb would have suggested 270 matches, and this is in fact
\par > the number with the highest probability (2.84%).
\par 
\par Here is where you have definitely gone bonkers. I dare you to take language
\par with 1,000 words and get 270 matches. I dare you to take a real language
\par like English with 100,000 words and find 270 accidental matches with
\par any language which presumably is not related. Actually, I will make it
\par worse. If you can get 270 matches out of 1,000 words, then for 100,000
\par words (average for a real language) you should find 27,000 matches. 
\par That's what you owe us; 27,000 matches because real languages do not
\par have 1,000 words; they have 100,000 words or more.
\par 
\par Ok, if you start now and get the whole linguistics community who might
\par be cheering your valiant efforts to defend Western civilization against
\par the barbarians, in a few hundred years you should be able to find
\par 27,000 accidental matches, say between English and Dyirbal, or Hakka
\par or Nilo-Saharan or Mongolian or Chinese. Why don't you start now and
\par find those 27,000 accidental matches.
\par 
\par > I will suggest refinements to this model below, but the basic features
\par > are in place: a probability for a single match; a calculation for number of
\par > expected matches; and an adjustment for phonetic and semantic leeway.
\par 
\par MOre refinements? I can't wait.
\par 
\par > Real phonologies
\par > We'd like to remove the unrealistic assumptions in this model, starting
\par > with the absurdly simplified phonologies. Fortunately this is not hard to do.
\par 
\par I am sure nothing is too hard for you. Maybe you have invented an
\par anti-gravity machine and are keeping it secret in your closet :-)
\par 
\par >it amounts to finding a better p-- an estimate for the chance of a
\par > random match which takes into account the actual phonologies of
\par > languages A  and B.
\par 
\par Yep. How about p=1/N like before with N=100,000 instead of 1,000
\par to make it more like a real language?
\par 
\par > Suppose we want to check for random matches between Quechua and Chinese.
\par 
\par > First, we need to decide what constitutes a phonetic match between the
\par > two languages. One way of doing this is to decide for each Quechua phoneme
\par > what Chinese phonemes we'll accept as matches. (Think of it this way: is Qu.
\par > runa a match for Ch. r\'e9n? Is chinchi a match for chong? Is chay a match
\par > for zh\'e8?)
\par 
\par > We might decide as follows. The criterion here is obviously phonetic
\par > similarity. We could certainly improve on this by requiring a particular
\par > phonological distance; e.g. a difference of no more than two phonetic
\par > features, such as voicing or place or articulation. The important point,
\par > as we will see, is to be clear about what we count or do not count as a
\par > match; or if we are evaluating someone else's work, to use the same phonetic
\par > criteria they do.
\par 
\par >  Qu. Ch.
\par >  p   p, b
\par >  t   t, d
\par >  ch  ch, zh, j, q, c, z
\par >  k   k, g
\par >  s   s, sh, c, z, x, zh
\par >  h   h
\par >  q   h, k
\par >  m   m, n
\par >  n   m, n, ng
\par >  \'f1   m, n, ng, y
\par >  l   l, r
\par >  ll  l, r, y
\par >  r   l, r
\par >  w   w, u
\par >  y   y, i
\par >  a   a, e, o
\par >  i   i, e, y
\par >  u   u, o, w
\par 
\par Well Qu. has 3 Vs and about 15 Cs so now all you have to do is find a way
\par to count how many words it has. Now you should only be counting CVC
\par syllables, nothing longer. And to be sure, you should first have some
\par some double-blind way to select the 1,000 words (CVC syllables) in 
\par question from Qu. and Ch. The way you are attempting is cheating. Every
\par statistician and even psychologist who deals with stats knows that. What you
\par have done is first find all the words that sound alike from Qu and Ch using
\par all of the words existing in these languages. What happened to the 1,000
\par word limit of basic vocabulary words?
\par 
\par 
\par > We will next need to know the frequency with which each phoneme occurs in
\par > each language. This can be calculated using a simple program operating  on
\par > sample texts. For Quechua we find:
\par 
\par I can see now you are ready to duplicate the Ringe fallacy. YOu should know
\par that he has already changed is mind about the Binomial. Apparently someone
\par told him that he should use the Hypergeometric. Now is the time for you
\par to do your homework and figure out why. This time no help from me.
\par 
\par >     initial    medial     final
\par >  a  5.291005   25.906736  40.211640
\par >  b  2.645503   0          0
\par >...... [snip]............
\par >  ng 0          5.834186   12.016293
\par >  sh 7.800000   1.330604   0
\par >  zh 5.600000   1.740020   0
\par 
\par > (The reader who knows Chinese may wonder how we can have medial
\par > consonants  at all. The answer is that I am using Chinese lexemes, not single
\par > characters (z\'ec), so that, for instance, Zhonggu\'f3 'China' is one word,
\par > not  two.)
\par 
\par No, actually I am wondering why you are doing all this when you obviously
\par do not keep up with the work being done in linguistics, copy models from
\par others only part way without understanding why it is that way and why
\par you should try to learn more about what is being done by others and try
\par to learn more probability theory. I am not a probabilist but, guy, this
\par is really takes the cake. If you are this desperate to become a hero
\par to poor linguists at least take if off the web page and keep it private.
\par The least you can do is copy other linguists who create "Mutual Admiration
\par Societies" using the new technology called Internet Mailing Lists and
\par use the ideas from 1984 to create censored lists and call them moderated.
\par It is not true that "you can ride but you can't hide". YOu can run and
\par you can hide, but not on WWW and not on USENET newsgroups. If you want
\par to hide go do it on some censored mailing list.
\par 
\par ...... [snip]...................
\par 
\par > Was all that worth it?
\par 
\par IT certainly was not. I am sure you will agree with me.
\par 
\par >...
\par 
\par > Second, seemingly minor points of procedure have a huge impact on our
\par > results. We are used to situations where rough calculations do not lead
\par 
\par Yes, great observations. 
\par 
\par > us far astray. But in this area differing assumptions or methodologies lead
\par > to very different results. Very careful attention to both is warranted.
\par 
\par Yep. Do not forget that you are the easiest person to fool and you do
\par not want to fool yourself.
\par 
\par > To analyze a claim about language relationship based simply on
\par > resemblances (as opposed, of course, to one based on the comparative method), we can
\par > apply the principles and formulas developed above.
\par 
\par See above. ONly a linguist can prefer a rule of thumb based on nothing
\par more than gut feelings to mathematical calculation and science. And read
\par my papers on the subject. If you have questions and indeed want to learn
\par more instead of pontificate thinking you are saving world civilization
\par you are welcome to joing the "Language" list. Send email to 
\par majordomo at csam.montclair.edu.
\par 
\par 
\par Sincerely,
\par 
\par M. Hubey
\par hubeyh at mail.montclair.edu                
\par http://www.csam.montclair.edu/~hubey
\par 
\par 
\par }