Principled Comparative Method - a new tool
Jon Patrick
jonpat at staff.cs.usyd.edu.au
Thu Aug 12 04:27:07 UTC 1999
Lloyd Anderson said
The real task of extending the Comparative Method to deeper time
depths is to make explicit more of these sophisticated tools,
and to CREATE more such tools by discovering what ways of
handling the data are robust across what kinds of intervening changes.
I would like to present to the list the description of a tool I developed with
a doctoral student to address the issue of measuring the phonological
relatedness of languages. It seems to address in part some of the issues
introduced by LA. We applied it to measuring the distance between modern
Beijing and Modern Cantonese. We initially wanted to use it on the basque
dialects but couldn't get sufficient data.
The idea is that the distance between languages is represented by the series
of changes that occur to a large set of words in moving from their parent form
to their daughter forms, so that distance apart is not measured between the
daughter languages but rather by their distance from their parent. We feel
this better represents the real world process.
Our data has to be the word set in the parent form (reconstructed words or
real words) and then one word set for the each daughter language and the set
of phonological transformation rules between each parent and daughter for each
word in their chronological sequence. Hence we are modelling the rules and
their
sequence of application for each word. The extent to which any of this
information is hypothetical merely defines the hypotheses one is comparing,
but
importantly it does not effect the computational method we apply to this data.
The computational method is in part new and in part old. The old part is that
for the sequence of phonological rules for the first word of a
parent-daughter
couple we construct a finite state automata. For the second and subsequent
rule sequences we overlay the rule sequence on the original automata creating
new
transitions were needed for new rules as yet unrepresented in the automata. By
counting the transitions along each pathway as we build up the automata we are
creating frequency counts of rules in their sequence of application. Such an
automata is a Probabilistic Finite State Automata (PFSA). Once all words are
placed on the Automata it is the Canonical PFSA that describes the total
diachronic rule set and structure for that parent-daughter pair and nothing
else. This description then captures the characteristics of the total data set
(of this class of data).
The newish part of the method is to apply the
principle of minimum message length (MML) encoding to calculate the cost of
the message to describe this PFSA. This is an information theory principle
which of itself dates back to the 1940's but our development of it for PFSAs
is new. If we have the cost of the messages for two parent-daughter pairs then
the shorter cost represents the daughter that is closer to the parent. In the
case of modern Cantonese and Beijing we got 35,243.58 bits and 36790.93 bits
respectively, indicating Cantonese is closer to the common parent, Middle
Chinese, than Beijing. The difference between these 2 numbers can viewed as an
approximate odds ratio 1:2^diff (that is meant to be "two to the power of the
difference").
However a further analysis can be performed. The canonical PFSA can be reduced
, by merging states, to some form that yields a minimised MML. Such a
minimised PFSA strikes a balance between the number of states and the
frequencies of the transitions out of each states. essentially it merges
together paths through the PFSA that have relatively similar rules and
frequencies of rules and also places very rare transitions in places that
lessens their cost, e.g. transitioning them back to their originating state. A
minimum PFSA does exist although you can't guarantee that you can find it.
Note that no data is thrown away. Everything is always kept in the PFSA.
The minimised PFSA give the following results for Cantonese and Beijing as
30379.01 bits and 30366.55 bits respectively. In each language pairing the
number of states is reduced by about 80% and the number of arcs by 50%.
The most interesting part of this result is that reversal of the results as to
which is closest to its parent. In the first case being Cantonese, and the
second Beijing. This distinction is more pronounced when Allophonic features
are also included in the analysis. One appraisal of these results is that the
generalisation process(=PFSA minimisation) has discerned more structure in
Beijing than Cantonese. The analysis of the generalised Automata revealed
hitherto unsuspected relationships between diachronic rules.
Our method should be useful to appraise competing reconstructions of earlier
languages,say Indoeuropean, however to date we have not been able to find the
necessary data compiled in one place to easily apply it. Should anyone have a
good database of appropriate data we would be happy to submit it to our
methods.
Jon Patrick
More information about the Indo-european
mailing list