Clarifications regarding "UPenn Tree"

Mon Sep 20 23:53:59 UTC 1999

I'd like to clarify some issues regarding the tree produced by Ringe,
Warnow, and Taylor.

Following is a portion of the input to an earlier run of the program; I've
edited it a little for clarity, and it's not the most recent version, but
the essential form has not changed:

Char. 1  2  3  4  5  6 ...
__________________________
Hi    2  1  1  2  1  3 ...
Ar    1  2  4  1  3  4 ...
Gr    1  2  1  1  2  1 ...
Al    3  3  3  5  3  4 ...
TB    4  4  2  3  1  1 ...
Ve    2  1  2  6  1  2 ...
Av    2  1  2  7  1  2 ...
OCS   2  5  2  8  4  5 ...
Li    5  6  4  9  5  6 ...
OE    6  7  5  10 6  2 ...
OI    7  8  2  11 7  1 ...
La    2  9  2  2  2  8 ...

The meaning of the characters and their values are:

1. augment
    1. present  2 &c. absent
2. thematized aorist
    1. absent   2. present or immediately reconstructable
    3  aorist lost, or unclear
3. productive function of *-sk'e'/o'-
    1. iterative   2. inchoative  3.  causative  4 &c. other, or lost
4. function of *dhi'
    1. imperative (only)  2.  past (with imperative relics)
    3 &c. lost or unclear
5. mediopassive primary marker (sg., 3pl.)
    1. *-r   2.  *-y (= */-i/)   3 &c. lost
6. thematic optative
    1. *-oy-   2. *-a:-   3 &c. absent, or preform obscure

What the algorithm does is to take this coded input, and to figure out
which of the 34,459,425 possible trees is the best fit for the IE family.
The team actually used a rather larger set of characters, but this subset
illustrates what the table of characters looks like.  I trust that the
abbreviations for the languages representing each branch are readily
recognizable.

One thing I'd point out is that if several branches have all lost
something which existed in PIE, each of these languages is assigned a
_unique_ value for the corresponding character.  If one were to code them
all with the _same_ number (meaning 'lost' or 'not present'), this would
tend to group all of those languages together.  But we know that a loss of
a morphological category, etc., is certainly something which can readily
happen as a parallel innovation, so each language which underwent such a
loss is given a unique value to prevent them from being incorrectly
grouped.

This is _all_ that the input to the program looks like.  To give the
complete input would merely be a matter of extending the chart further to
the right for the rest of the characters (I'm just too lazy to type it all
here). There have been some rather fantastic guesses in recent posts about
what kind of input the program takes and what kind of computation it can
do.  I hope it's clearer now that the program takes a table like to one
above as its input and gives a tree like the one I gave as its output.

In an earlier post, I said that "reconstructed forms were not included in
the data."  I see now that I should have clarified what I meant by this,
because there has been some misunderstanding.  What I meant is that there
is no row in the table for PIE as there is for Hittite, Old English, etc.
Including such a row with characters coded for PIE might be one way of
trying to determine the root node (remember that the algorithm produces an
unrooted tree); but this isn't what the team chose to do.  That's what I
meant when I said that reconstructions weren't used.

I definitely did _not_ mean that there wasn't any reconstruction involved
in any way at all.  The team made standard assumptions about what
reconstructed PIE looks like, and is was this standard set of assumptions
which permitted them to create the table which served as input to the
program.  For example, the optative *-oy- turns out in different ways in
different languages depending on the language-specific sound changes; but
it is by performing comparative reconstruction that one can recognize that
these superficially different reflexes in fact all derive from the same
PIE suffix.  The appropriate languages are are accordingly all assigned
the same character value.

I hope this has cleared up how the tree was produced and what the
algorithm in question is intended to do and is able to do.

  \/ __ __    _\_     --Sean Crist  (kurisuto at unagi.cis.upenn.edu)
 ---  |  |    \ /     http://www.ling.upenn.edu/~kurisuto/
  _| ,| ,|   -----
  _| ,| ,|    [_]
   |  |  |    [_]