the Chinese study

Fri Sep 10 11:02:08 UTC 1999

[ moderator re-formatted ]

    I wrote
    <<In *PIE, certain aspects are considered the innovations of a particular
      daughter language because they do not appear in the other daughter
      languages, and are therefore factored out of the reconstruction.  If you
      only have two daughter languages - as you did above - how do you identify
      the innovation versus the original form in reconstruction?>>

    jonpat at staff.cs.usyd.edu.au wrote:
    <<If I understand "innovation" correctly it has to represented by a rule of
     insertion from a null position. that's not a problem it's just another
     rule at  a particular point in the Relative Chronolgy. The algortihm will
     process it correctly.>>

    This would appear to be a moot point from what you said above.  But just to
    be clear:  I'm not sure where your algorithm starts, but my point was
    simple.

    We find a state B(F1) and C(F1).  Languages B and C differ by the use of
    say one phoneme alone.  Otherwise they are identical and coeval.  We assume
    and reconstruct a parent A.  The lone phoneme difference between B and C
    creates an unknown: whether the phoneme in B is from the parent or whether
    the phoneme in C is from the parent.  If we conclude B is identical to the
    parent, then C carries the "innovation."  (Forget about dual innovations
    for now.)

    Based on the above there is no statistical certainty at all in choosing B
    over C or vice versa.  It is not the "insertion from the null position"
    that is the issue I think you will see here, but in fact how that insertion
    decision affects the reconstruction.  Reconstructions should work backward
    in time.  So if "insertion" = "innovation", it presumes in fact that the
    "inserted" data was not in the parent.  But in fact we are in complete
    uncertainty about that fact.  (But again you are not reconstructing.)

OK, I think I understand your question and the problem statement.
I would frame your question a different way. Firstly you have the issue of
what is the optimal reconstruction for a given child. So you construct
multiple putative parents for a child and use our method to choose the
reconstructed parent that best fits the data. You repeat this process for the
sister language and so arrive at two reconstructed parents, one for each
daughter. Then you determine the distance of the daughters from the other's
parent. Whichever gives you the - accumulatively least cost would be the
preferred parent. I will speculate (but get back to you later on the matter)
that the message length for describing the pair of daughter languages for each
putative parent is directly additive because you are merely describing the
cost of one followed by the cost of the other. (FOOTNOTE: there could be some
coding strategies that might be usable to compress the message lengths, say
for example merging the two PFSAs of the daughters for each of the putative
parents).
Does this answer your question? - not directly. I think the answer is in your
own words. There is no discriminatory information in the data as described
that a coding strategy could exploit to give you a choice of solutions.

   Since you also said that your approach can only compare two reconstructions,
   this may not be a problem for you.  Although you will not be able to reduce
   the uncertainty in the example above no matter how many reconstructions you
   test.  Because two alternative reconstructions will not necessarily make one
  of the choices better than the other.

An off the cuff answer is I agree with you. Remember however that our method
relies on a reasonably sized data sample. So if the innovation is rarely used
in the sample it will contribute little to discriminating the models. Single
occurences of rules make little contribution to the discriminating between
competitive parents (no pun intended)

    It may seem trivial in terms of the work you are doing.  But this
    fundamental uncertainty in any reconstructive process can yield very
    different results in subsequent analysis using those reconstructions as a
    basis.

I don't have any sensitivity to the strength of  this comment as my historical
linguistic knowledge is limited, so i shall accept it on your word.

Jon patrick