[language] Kessler REview

Tue Feb 11 16:03:10 UTC 2003

<><><><><><><><><><><><>--This is the Language List--<><><><><><><><><><><><><>

BOOK REVIEW

by H.M. Hubey, Department of Computer Science, Montclair State
University, New Jersey

DESCRIPTION OF THE BOOK

Kessler, Brett. 2001. The Significance of Word Lists. CSLI Publications,
x+277pp, hardback ISBN 1-57586-299-9, paperback ISBN 1-57586-300-6,
Dissertations in Linguistics. Announced at
http://linguistlist.org/issues/12/12-790.html#1

DESCRIPTION OF THE BOOK

The major issues addressed in the book are (i) concept of distance:
similarity, (ii) comparative method, (iii) statistical tests,
specifically the chi-square test, (iv) data-cleaning so that the
chi-square test gives good results. Quick conclusion/summary is in
order: (i) the book is excellent, (ii) contrary to expectation it is not
about statistics but rather linguistics, (iii) its significance lies in
its use of the methods of probability theory in a comprehensive way
instead of simple and patched methods used previously. In fact, the
book's jacket explains the problem most succintly:

                  The most strident controversies in historical
linguistics debate whether claims for historical
                    connections between languages are erroneously based
on chance similarities between word
                    lists. But even though it is the province of
statistical mathematics to judge whether evidence
                    is significant or due to chance, neither side in
these debates uses statistics, leaving readers
                    little room to adjudicate competing claims
objectively. This book fills that gap by presenting a
                    new statistical methodology that helps linguists
decide whether short word lists have more
                    recurrent sound correspondences than can be expected
by chance. The author shows that many
                    of the complicating rules of thumb linguists invoke
to obviate chance resemblances, such as
                    multilateral comparison or emphasizing grammar over
vocabulary, actually decrease the power
                    of quantitative tests. But while the statistical
methodology itself is straightforward, the author also
                    details the extensive linguistic work needed to
produce word lists that do not yield nonsensical results.

The first problem Kessler tackles is the usual confusion between
"resemblance" and "cognate", "proof" vs "statistics", and "distance" vs
"similarity". It is not unusual even for "alleged" quantitative
linguists to get these concepts backwards, and then use Freudian
projection as defense.

First, the most important concept. Suppose we attempt to ascertain the
abstract property which this compound word represents: hotness:coldness.
>From what we know we can see that these measure the same thing but the
scales are running in opposite directions. In this case, the property we
measure has an unequivocal name, temperature.

Suppose we attempt it with, nearness:farness. It is easy to see that the
property being measured is distance. However, this word is not abstract
enough. We can also measure "time" with the same compound word. Or we
may use long_ago:recent, or even distantness:recentness. In perceptual
space (not physical space or temporal space) the common word in use is
"similarity". Thus distance:similarity measures a concept called
"distance". The reason for this seeming contradiction is the fact that
natural languages often have words with two meanings.  We might correct
it via similarity:dissimilarity which then measures "distance". It is
this dual usage of distance that often confuses people especially in
 conjunction with the word "similarity" or "dissimilarity". In other
words we have distance1:similarity and this concept we measure via
distance2.

Having said this, we can say that comparative methods are attempts to
measure how far from
chance the observed data is, nothing more, nothing less. This much is
made crystal clear by
Kessler, who obviously understands historical linguistics methodology
better than some linguists despite being a psychologist. But surely not
being a linguist should not be held against him. It is not unusual for
new methods to be brought into a field by outsiders. It happens in
physics, engineering, computer science, genetics, biology, economics.
Why not linguistics?

Once this is clear, then it becomes clearer why binary comparison can be
put on the same
footing as multi-way comparisons. After all, whatever the data
represents, all we want to
know is "what is the probability that this data occurred purely due to
chance?" We obviously want this number as small as possible if we want
to conclude that the data represent an event that is not due to chance.

Kessler also makes it clear that if we obtain a very small probability
that the data represents a state of events that is not due to chance,
all we can conclude is exactly that. How it came about depends on other
assumptions.

Kessler is very clear on the fact that the Swadesh list is nothing more
than a formalization of concepts that historical linguists developed
over centuries. A list that would be useful for making tests, a list
that avoided technological borrowings, onomotopaic words, and other
"unreliable" words, and those lists were created by Swadesh. Any
linguist who has anything against the lists is not in disagreement only
with Swadesh but also with the basic postulates of historical linguistics.

In summary what we want is a test or a number that tells us how far from
chance distribution the data are. There is such a test. It is the
chi-square test. But there is a catch; the data must be independent.
That means that if the word for finger and foot come from the same root,
they are not independent, and the chi-square test will give incorrect
results since it is based on the independence hypothesis. Kessler gives
a small and short example of how it works. In truth he probably should
have given a whole chapter, or two on the mathematics of the chi-square
test, however he probably decided that it can be found in any statistics
books. Instead he concentrates on selection procedures for the words. At
the end of the book there is the Swadesh list for the languages which
Ringe used for his early attempt at use of statistics. Kessler shows
that many of these words are borrowings from other languages. In any
case, any test will give incorrect results if the inputs are incorrect.
In computer science it is called GIGO, Garbage-In, Garbage-out. There is
never any substitute, not yet, for human intelligence. However, there is
now at least one way of comparing the closeness of languages to each
other using some number. That is about closest concept to "distance"
that historical linguistics has ever reached. It would have been better
if he had developed it further by normalizing it. For example, let
d(x,y) be the 'dissimilarity" between languages x and y. Then let s(x,y)
be the 'similarity between languages x and y. Then by normalizing these
quantities to the interval [0,1] we can easily see that s(x,y) = 1 - d(x,y)

Obviously, d(x,y) should be normalized so that d(x,x)=0. It can be seen
already that if z, and w are two "most-distant" languages then d(z,w)=1.
That is still something that still needs more work. As already
mentioned, most of the book is spent on the actual results of the
comparisons amongst the languages. Kessler makes various changes e.g.
Swadesh 100 vs Swadesh 200, using only the first phoneme vs using more
phonemes, etc. He also discusses thoroughly the problems with using more
than a single phoneme, or even a single phoneme. The problem is that we
do not know which phoneme should be cognate with which phoneme. In other
words what if one language has lost the initial consonants. Then we
would be attempting to match a consonant to a vowel which is certain to
produce bad results. In the words of datamining, this is called
data-cleaning, or pre-processing and it is an important part of
analysis. Kessler discusses such problems thoroughly and clearly.

The final result is that historical linguistics is on its way to
becoming a rigorous science like those that preceded it. Kessler
probably could have spent more time (and space) explaining the concept
of hypothesis testing, false positives, false negatives, etc. even if
only in an appendix.

Kessler discusses in other chapters how to go about making use of
consonants other than the first one in the comparanda. The main problem
is the one faced by researchers in speech recognition and genetics. The
phonemes have to be "aligned". That is, it is possible that one of the
languages could have lost the initial consonant,or could have gone
through a metathesis, etc. . Therefore some algorithms are needed to
automatically obtain optimum alignment. These require the existence of
phonetic/phonemic distance, but these are much easier than semantic
distance (and do already exist in various forms, even if only
implicitly). There is much more to the book that should be of interest
to historical linguists.

In summary, the book is excellent but might require some work for those
linguists who have math-anxiety or have any kind of aversion to
quantitative techniques. However, beginning statistics courses are now
taught at universities at the general-education level, and there is no
excuse for anyone not to at least have some grasp of the fundamentals of
statistics and probability theory. Time never goes backwards.

--
M. Hubey
-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
The only difference between humans and machines is that humans
can be created by unskilled labor. Arthur C. Clarke

/\/\/\/\//\/\/\/\/\/\/ http://www.csam.montclair.edu/~hubey

---<><><><><><><><><><><><>----Language----<><><><><><><><><><><><><>
Copyrights/"Fair Use":  http://www.templetons.com/brad/copymyths.html
The "fair use" exemption to copyright law was created to allow things
such as commentary, parody, news reporting, research and education
about copyrighted works without the permission of the author. That's
important so that copyright law doesn't block your freedom to express
your own works -- only the ability to express other people's.
Intent, and damage to the commercial value of the work are
important considerations.

You are currently subscribed to language as: language at listserv.linguistlist.org
To unsubscribe send a blank email to leave-language-4283Y at csam-lists.montclair.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/language/attachments/20030211/e69c6992/attachment.htm>