STATISTICS IN LINGUISTICS

Sat Mar 13 22:11:39 UTC 1999

   From: "Patrick C. Ryan" <proto-language at email.msn.com>
   Date: Sat, 27 Feb 1999 00:12:13 -0600

   Thank you for a lucid explanation of probabilities for all of us.

   I agree with everything you have written with the exception that
   the situation in historical linguistics is so problematic.

   But, why do you not explain in detail why you think it is --- not
   based on a priori asssumptions but on analysis of data?

I'll tell you what would be needed for a probabilistic test of
language relatedness to be valid.

First, you must decide on a protocol to follow which will provide an
objective measure of the similarity between two languages. The
important thing is that this measurement must not depend on the
researcher's knowledge of the languages --- on the contrary, it should
be repeatable with consistent results by different people. NOTE: this
measure of similarity is your experimental result --- any conclusions
about relatedness would only follow after statistical analysis.

One possible protocol might be to simply hand out dictionaries to a
few undergraduates that never even heard the names of the languages
before, and letting them find as many similarities as they can. Your
experimental result is the median number of entries in their lists.

Another would be to let an expert in each language supply the best
word for each node in one of the `semantic networks' that are being
promoted, expressing it in IPA. The researchers must work separately,
neither knowing the identity of the other language. You could then let
a computer program compare the lists, using a fixed algorithm to look
for inexact semantic and phonological matches, and giving points for
each according to the exactness.

Next, you must apply this experimental method to a large number of
language pairs where there is already general agreement about their
degree of relatedness (by descent and borrowing). Large probably means
a few hundred to a thousand. Draw up charts of your experimental score
against the known degree of relatedness, and see if something
statistically significant emerges.

But even if you find a significant correlation, it's quite possible
that it is not strong enough to predict anything. For that you need a
result like '90% of language pairs related at a depth of less than 500
years scored more than 82 points', which will allow you to assert that
a new pair of languages that scores less than 82 points is probably
not related at a depth of less than 500 years.

For your purposes, a last problem remains. There _is_ no agreement on
language relatedness at the time depths of Nostratic, much less Proto-
World, so you would have to extrapolate your data. Even if a trend
could be identified by proper statistical analysis, extrapolation will
lessen the credibility of the final results.

Once you have done this work, you can run Igbo and Inuit through your
measurement process and see if the number you get tells you anything
besides 'not discernably related'. And if it does, you can claim to
have statistical evidence of their relationship.

This is something like the standard social scientists have to live up
to if they do not want their results to be dismissed out of hand.

But, to borrow a phrase, you must surely agree that this is not the
way historical linguistics are done today, by you or anybody else, and
therefore any attempt to use statistics to defend your hypotheses is
just so much hot air.

Lars Mathiesen (U of Copenhagen CS Dep) <thorinn at diku.dk> (Humour NOT marked)