[Q] Universals, Statistics

Bill Croft croft at CASBS.STANFORD.EDU
Fri Sep 26 17:43:54 UTC 2003

Dear Eddy,

      Your query about statistical techniques in typology is useful, and
I think points out some fundamental differences between the
typological and generative method. You wrote:

>The reason I seem to feel this way is that, if those patterns
>are obtained in this manner, and then stated as generalizations over
>language types, one proceeds in a post-hoc manner. The correct
>procedure, I would guess, would be to first hypothesize that
>a certain pattern must exist, and then attempt to disconfirm this
>hypothesis on the basis of a data set. After reading those articles
>(but perhaps this is where my mistake lies) I was left with the
>feeling that this is not the way people proceed.

      You're right, this is not the way typologists proceed. The method
used in typological research is inductive: examine a sample of
languages (ideally, a sample designed to be reasonably representative
of the total universe of languages), and inductively derive
universals of language based on what is observed in the sample. The
method you prefer is deductive: propose a hypothesis and then
(dis)confirm the hypothesis on the basis of a dataset. The deductive
method is the one commonly used, or at least advocated, in the
generative tradition.

      The methods may appear to be opposed but this would be a
simplistic characterization of the difference. Typologists after all
cannot look at every grammatical feature at once, and a typological
study examines a range of grammatical features that the researcher
thinks are likely to exhibit a significant correlation deserving of a
causal explanation. The typologist's choice is obviously a hypothesis
developed before looking at all the languages in the sample (though
often a literature survey or a small pilot sample is used to get on
the right track: looking at a set of features taken out of the air
for a 250-language sample would be a very risky investment of
research effort). Conversely, no generativist that I know formulates
a hypothesis without looking at any data in any language.

      Your final comment also illustrates another difference between the
typological and generative methods:

>      Likewise, given the number of possible patterns in a data set, some
>statistically unlikely patterns will occur, even if the data set
>were completely random and there existed no underlying laws or
>tendencies governing human language variation.

First, this is not entirely true of the population of actual human
languages: there are many patterns that do not occur. A favorite
example of mine is a language with a suffix on nouns where the number
of referents is an integer solution of the equation x\2 + 3x + 2 and
a different suffix where the number of referents is not an integer
solution to this equation. I would bet that this is not a possible
natural human language, and of course an explanatory theory of
grammatical number should give a principled reason for excluding such
a language type. However, a typologist would also want an explanatory
theory to account for the probability distribution of attested
language types. This is where typology differs from generative
grammar (as I understand it, but my knowledge may be outdated), since
generative grammar has (traditionally) been concerned exclusively
with the question of what is a *possible* human language. Typologists
also consider it a matter for grammatical theory to account for what
is a more vs less probable human language type, among the types that
actually occur.

       Having said all this, I cannot sincerely conclude without giving
one reason why I use the typological method rather than a more
deductive method (there are others). If one constructs a hypothetical
language universal before looking at a representative sample of
languages (even a relatively small one), for instance by examining
one's own native language or a few well-known languages (such as some
large western European standard languages), then the universal is
much more likely to be invalid than if one constructs the hypothesis
on the basis of data from a representative sample. A useful
discussion of this issue is found in Gary Gilligan's PhD dissertation
on the pro-drop parameter (USC, 1987); it is unfortunately
unpublished, but there is a brief summary in the second edition of my
"Typology and Universals" (CUP, 2003), pp 80-84.

