Basque statistics

Thu Aug 12 15:08:24 UTC 1999

We have far too little study of the biases and
distortions imposed by our tools of analysis,
at the same time as the tools are valuable.

Thanks to Larry Trask for a clear statement of ways of purifying
vocabulary lists, in attempting to identify native ancient Basque
vocabulary.

If we do not believe any list is TOTALLY "pure",
then we are rather concerned with degrees of "purity",
relatively more pure and less pure.

Rather than using particular cutoff criteria,
it can be more sophisticated (though also much more work)
to study the DIFFERENCE between vocabulary sets selected
by a particular criterion from vocabulary sets which were not limited
by that particular criterion.

While early attestation is obviously a valuable cutoff (3),
if the aim is to achieve the absolutely purest vocabulary set possible,
it may run the error of excluding much authentic native Basque
vocabulary which simply happened not to be recorded "early".
Can we rather have several "degrees of earliness", or distinctions
of WHICH sources of attestation?  Trask does some of the latter,
suggesting to exclude some sources which he believes are
particularly unreliable.

>   (3) It is attested early.
>
>   [Let's say before 1600, which is `early' for Basque.]

***

The following (4) is also perfectly reasonable,
if intended merely to achieve the "purest" possible vocabulary set,
but it also can exclude legitimate native ancient Basque vocabulary,
especially if, for example, some of that vocabulary has been borrowed
FROM Basque INTO another language or languages.

>   (4) There is no reason to believe that it is shared with
>   languages known to have been in contact with Basque.

>   [Subjective, and hard to formalize, but I believe that doubtful
>   cases are few enough to constitute only a minor problem.]

Nursery words may be among the most persistent in many cultures,
so I do not see a reason to flatly exclude these.  Label them as nursery
words,
perhaps (though the category is actually much broader than the two
examples)...

***

>   (5) It does not appear to be a nursery word.
>...
>Now, (5) would exclude only a very few words not excluded by the other
>criteria, notably <ama> `mother' and <aita> `father',

And even sound-imitative words may sometimes be of use,
they can undergo regular sound changes (as for laughter
"ha-ha" becoming Russian "xoxotat' " with a>o shift).
Or more borderline, "teeny" becoming "tiny" in the English
Great Vowel Shift, regenerated (or borrowed from other dialects?)
as "teeny" again.  So again, not absolutely excluding them,
but studying what difference it makes to patterning if they
are included vs. excluded.

>   (6) It does not appear to be of imitative origin.
>while (6) would
>exclude a much larger number of items which would be automatically
>excluded in any serious comparative work, like <miau> `meow', <mu>
>`moo', <be> `baa', <din-dan> `ding-dong', <tu> `spit', and probably also
><usin> `sneeze'.  This last sounds roughly like oo-SHEEN, and, in my
>view, is too likely to be imitative to be included in any list.

***

Even asking that a word be found in all or many dialects is not a simple
criterion:

>   (2) It is found throughout the language, or nearly so.
>
>   [Since the better dictionaries assign words to the conventional
>   dialects, it is easy to formalize this requirement as we
>   see fit.]

There are other subsidiary criteria which can reinforce or undermine
the probabilities that a word was ancient Basque, which can be
combined with information about WHICH dialects it is attested in,
not just how many of them or which branches of the dialect family tree.

***

My point in all of the above is that using simple cutoffs is a kind of
rush-to-judgement, the opposite of the ability to delay judgements
which is the hallmark of many good research personalities.
The reason to avoid simple cutoffs is because it throws away potentially
important data.

But there is every reason to USE every one of the criteria
(those which Larry Trask expressed as cutoffs),
to use them as LABELS on the vocabulary items,
which in a computer database can be taken into consideration
whenever any question is asked of the database.

In most studies of canonical forms, ones which do attempt to purify
vocabulary lists, I would expect there is a statistical tendency known as
regression towards the mean, that is, some reinforcement of universally
typologically dominant patterns, such as CV-CV(-CV) syllable structure.
It would be where we find that deviations from such universal patterns
are reinforced statistically by steps of "purifying" a vocabulary set
that we would have the most interesting characteristics perhaps attributable
to an early or proto-language.

Lloyd Anderson
Ecological Linguistics