background noise

Fri Mar 26 12:22:51 UTC 1999

[ moderator re-formatted ]

ECOLING at aol.com wrote:

> In a message dated 3/24/99 7:58:55 PM, John McLaughlin wrote:

>>I did some random language generation and comparison on computer based on
>>known phonological inventories and frequencies.  The results were published
>>in the most recent Mid-America Linguistics Conference Proceedings and are
>>also available on Pat Ryan's web site (where he graciously notes that I don't
>>agree with hardly any of his findings)--although without the tables yet.
>>Computer-controlled comparison revealed that the closer two phonologies were
>>to one another the higher the frequency of random lookalikes and the smaller
>>the phonological inventories the higher the frequency of random lookalikes.

> Good.
> (The last of these seems to imply that human judgements are influenced by
> superficial similarities, or that some mechanical formula was done in such a
> way that it is influenced by such superficial similarities.  It is not clear
> from the quotation above whether the "comparison" part was by human or
> machine.)

The comparison was based on a table of correspondences constructed by me.  The
computer then slavishly matched according to this table.  I also did one run
where the computer only accepted exact matches.  The table of correspondences
is located with the paper at Pat Ryan's web site so you can see what I was
comparing.

http://www.geocities.com/Athens/Forum/2803/1998_MALC_PAPER.htm

In the case of the "similar phonologies", I'll explain what happened.  I used
the phonological inventories of eight languages and determined the frequencies
of each of the phonemes by doing some counting in dictionaries.  For the
smallest phonology and the largest phonology, I simply used them twice.  That
gave me ten sets of rules that the computer used to construct random
vocabulary.  So when I stated "similar phonologies", the phonological patterns
were identical.

> It would be important that the data of any such studies of randomness,
> and the exact computations used, should be in public view at all times
> and electronically available, so anyone with a more refined formula
> for counting things as "alike", or better, degree-of-similarity, can
> see the result of such definitional changes on the computations.
> That way we can refine our ideas of what we can look for in
> seeking similarities which are less likely to be due to chance.

I'm working on a revised version of my original simple program that will do a
variety of different kinds of comparison.  For the published version, what
kinds of "data and...exact computations" would you like to see in the paper?
(NOTE:  I'm not a mathematician or computer theorist so please don't use words
that are highly specialized in meaning.)

> And as I have urged a number of times, this should ALSO be tested
> against cases where languages are known to be related, to see whether
> degree of relationship can be estimated (of course given different
> degrees of intensity of change, it cannot be exactly a measure of time,
> but on average it could be.

Because of the semantic assumptions that I program into the computer-generated
languages, this is exceptionally difficult to do.  Additionally, the results
would not really match the results of computer-generation because a complete
pattern may not be possible between any two natural languages, let alone
between five related languages.  In addition, we are always stuck with holes
in the data.  However, I've developed a method to simulate it in the program.
When constructing the random language data, to replicate relatedness, the
computer takes languages X-Y (when I tell it to) and makes them related by
(depending on the time depth that I tell it to simulate) taking a certain
percentage of forms in L1 and copying them directly into L2 (sound change is
taken into account).  As L3 is constructed, the same procedure is used.
Different percentages are used in a formulaic way to simulate different
distances from L1 (sort of a lexicostatistic method of subgrouping).

> I am not naive enough to think that a mechanistic approach can substitute
> for good historical linguistics and philology.

Nor am I.  When one looks at Sir Jones' original assumptions about
Indo-European, there was far more there than simple lexical comparison and Bob
Rankin reminded me of this when I read the paper at the Mid-America Linguistics
Conference.  Language families that have been widely accepted as proven have
rule-governed morphological, syntactic, and semantic similarities on a large
scale as well.  The computer program simply gives us a feel for how close
lexical similarity should be before we get excited enough to do the other
comparisons.

Thanks for the input
John McLaughlin