uc and ucak

H.M.Hubey hubeyh at montclair.edu
Fri Oct 30 11:51:44 UTC 1998


----------------------------Original message----------------------------
Isidore Dyen wrote:
>
> ----------------------------Original message----------------------------
> The point that you make is quite right, but I believe that what you are
> dealing with is likelihood instead of mathematical probability. What lies
> at the bottom of the problem is that the lay word probability most
> commonly has the meaning 'likelihood' that is the relation between the
> respective probabilities associated with each outcome. ID.
 
Actually the biggest problem with probability calculations involving
linguistics (especially historical linguistics) is not about technical
matters but with setting up the problems correctly.
 
1. First, the simplest thing to do is the use uniform or equal
probabilities when other info is lacking.
 
2. Second, the simplest thing to do is to assume things like
independence,
uncorralated sequences, etc.
 
3. Third, some people, who are allegedly experts in this field, trip
over
simple things. Notice, for example, the Ringe and Manaster-Ramer
exchanges.
 
However, the main problem, as in all probability problems is to be
able to clearly know/define the sample space. The sample space is not
only those languages that are now in existence, nor only those
languages that were in existence but all languages that could have
possibly existed but did not. It is this last part that is the biggest
problem. Also as a part of this problem is the tendency to extrapolate
from simple results linearly without any justification for it. For
example,
someone might run a simple simulation problem with 2 languages, each
having 5 phonemes, and 100 words and getting 15 matches by accident and
then extrapolating that extending this to 1,000 words with 50 phonemes
etc will probably produce 500 matches. Or starting with calculations
using
the binomial distribution with p=0.001 (which might be appropriate for
1,000 words) and then using a semantic shift of 25 (because you can
find 25 words having to do with "eating" in English) while ignoring that
these 25 words came from the full English language with 100,000 to
400,000
words.
 
The other big problem is to assume that word distributions are all
totally
uncorrelated and independent when we know that they are best modeled as
correlated and as Markov processes. Even when simple models are used we
should not use independence.
 
Despite all this, there are even more elementary problems.
 
For example, if you toss a penny 5 times and get 4 heads (H) it does not
mean that since getting 5 heads is so rare, the penny sort of owes that
it should pop up as tails. It does not. It has no memory. The prob is
1/2 at each throw.
 
As another example, think about bus schedules. They arrive in dense
intervals
during morning and evening. Suppose the avg time between bus arrivals is
M. If a passenger pops up at a bus stop at random. What is is expected
wait?
Surely, we know it is longer than the average time between the bus
arrivals.
 
As another example, just because someone is guaranteed to win a lottery
of
1 million numbers, does not mean that the guy who won is not lucky. His
chances of winning was still 0.0000001.
 
Related to this is the fact that just because you are bound to find
gold someplace in the ground because there is so much of it, it does not
mean that just because you found a gold nugget when digging some place
that
there is no more there because it was just an accident. No. There is
probably more there, because this is not a purely random event. Where
there
is some, there is probably more.
 
The same holds for words. Every word you find, has to be thought of as
being improbable because you see, the biggest part of the problem is
missing. I don't know any Sanskrit. I just read maybe 20-40 words in
some
book by accident and one of them hit!
 
If I watch a movie with some Eskimos and all of a sudden, 3 words hit
me suspiciously just like some other words, anyone who tries to convince
me that it happens all the time better spend some time learning to
use probability theory to solve some real problems in the real world.
 
Next time you watch the natives of X-land on TV, listen to their
words carefully and see if you can spot any English! If you do, chances
are great that they will be speaking English. That is a Markov
process.
 
PS. The total sample space is not 6,000 languages. IT is maybe 1,000,000
languages that might have existed but did not (or disappeared).
 
PPS. Just because 80% of the world's population is white does not
mean that the fact that they resemble each other is not genetic. Of
course it is. Physical characteristics are genetic.
 
The most difficult part of using any kind of math is in knowing
when to use which formula. Those who botch things in the 8th grade
or earlier say "I hate those word problems". Some people are like
Duracells :-) They don't hate word problems, not in the 8th grade,
not 12th, not in the MS level and not in the PhD level, and not
even 20 years after the PhD level :-)
 
 
 
> On Wed, 28 Oct 1998, Larry Trask wrote:
>
> > ----------------------------Original message----------------------------
> > On Wed, 28 Oct 1998, H.M.Hubey wrote:
> >
> > > I am going to ask something that sounds very strange.
> >
> > > I noticed accidentally that a word like 'uchakka' (that is
> > > reasonably close) meant 'to fly' in Sanskrit. (I hope that was
> > > Sanskrit, and not Indic or Hindi).
> >
> > > In any case, it is very strange for me to see this since
> > > 'uch' means 'to fly' in Turkic. The choices are
> >
> > > 1. Accident
> > > 2. There is something we are missing
> > >         2.a) Uch was borrowed into Turkic
> > >         2.b) "uch" is protoworld
> > >         2.c) The root comes from the Ancient ME (ANE)
> >
> > > There might be more but these are good enough for a start.
> >
> > > 1) with odds of 1 to 100 or 1 to 1000, it behooves not to
> > > believe this at first.
> >
> > No, not so, I'm afraid.
> >
> > If you ask the question "what is the probability that a form <uch> will
> > mean `fly' in both Turkish and Sanskrit?", then the *a priori* odds
> > against are simply enormous.  But that's the wrong question.
> >
> > The right question is this: what is the probability that *some* short
> > form will turn up in *some* two languages with similar meanings in both?
> > And this time the answer is "as close to 100% as you care to get".
> >
> > With 6000+ languages available, all of them with thousands of words, and
> > with only a small number of distinguishable consonants and vowels
> > available to construct those words, it is statistically inconceivable
> > that we should fail to find many, many chance resemblances or identities
> > like this one.  Hence stumbling across one is evidence for nothing at
> > all, except that the laws of probability are not taking the day off.
> >
> > Failure to appreciate this constitutes what somebody has dubbed
> > "Koestler's fallacy", since the writer Arthur Koestler was notoriously
> > prone to it.  Koestler was constantly impressed by the observation that
> > *some particular* coincidence had occurred, reasoning that the *a
> > priori* odds against that particular coincidence were astronomical, and
> > hence concluding that Something Deeply Significant was going on.  What
> > he always failed to realize is that the probability that *some
> > coincidence or other* would occur purely by chance was effectively 100%.
> >
> >
> > Larry Trask
> > COGS
> > University of Sussex
> > Brighton BN1 9QH
> > UK
> >
> > larryt at cogs.susx.ac.uk
> >
 
--
Best Regards,
Mark
-==-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
hubeyh at montclair.edu =-=-=-= http://www.csam.montclair.edu/~hubey
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
The information transmitted is intended only for the person or entity
to which it is addressed and may contain confidential and/or privileged
material.  Any review, retransmission, dissemination or other use of,
or taking of any action in reliance upon, this information by persons
or entities other than the intended recipient is prohibited. If you
received this in error, please contact the sender and delete the
material  from any computer.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



More information about the Histling mailing list