Basque statistics

Fri Aug 13 15:14:38 UTC 1999

On Thu, 12 Aug 1999 ECOLING at aol.com wrote:

> While early attestation is obviously a valuable cutoff (3),
> if the aim is to achieve the absolutely purest vocabulary set possible,
> it may run the error of excluding much authentic native Basque
> vocabulary which simply happened not to be recorded "early".
> Can we rather have several "degrees of earliness", or distinctions
> of WHICH sources of attestation?  Trask does some of the latter,
> suggesting to exclude some sources which he believes are
> particularly unreliable.

Of course, any cutoff criterion is likely to exclude a few genuinely
native and ancient Basque words, but it will certainly also exclude a
much larger number of words which are not native and ancient.  And the
primary object is to exclude the words which should be excluded, not to
include every single word which should be included.

The objective is to construct a list of those words which have the
strongest claim to being native and ancient.  It is therefore far more
important to exclude every word which does not have a good claim to
native and ancient status than it is to sweep up every word which
*might* be native and ancient.  First things first.

> The following (4) is also perfectly reasonable,
> if intended merely to achieve the "purest" possible vocabulary set,
> but it also can exclude legitimate native ancient Basque vocabulary,
> especially if, for example, some of that vocabulary has been borrowed
> FROM Basque INTO another language or languages.

[LT]

>>   (4) There is no reason to believe that it is shared with
>>   languages known to have been in contact with Basque.

Well, we know for certain that Basque has taken thousands of words from
Latin and Romance, while loans in the other direction are very rare and
almost entirely confined to those local Romance varieties in direct
contact with Basque.  In fact, Basque <ezker> `left (hand)', which has
been widely borrowed into Ibero-Romance, is perhaps the *only* Basque
word borrowed widely into Romance, apart from those borrowed in the last
few years.

That being so, it seems clear that words shared between Basque and its
neighbors should be systematically excluded from our list, because, for
any given word of this type, the probability is overwhelming that Basque
has borrowed it.

If Basque loans into Romance were numerous, then the existence or not of
shared vocabulary would probably have to be rejected altogether as a
criterion.  However, such is not the case.

> Nursery words may be among the most persistent in many cultures, so
> I do not see a reason to flatly exclude these.  Label them as
> nursery words, perhaps (though the category is actually much broader
> than the two examples)...

I'm afraid I can't agree.  Nursery words are routinely excluded from the
initial stages of any comparison because they are so treacherous: they
are frequently invented independently in different languages.

*Once* a genetic link has been established, it is legitimate to see if
any nursery words can be reconstructed for the ancestral language.  But
you can't use words like <ama> `mother' as evidence for a link in the
first place.

> And even sound-imitative words may sometimes be of use,
> they can undergo regular sound changes (as for laughter
> "ha-ha" becoming Russian "xoxotat' " with a>o shift).
> Or more borderline, "teeny" becoming "tiny" in the English
> Great Vowel Shift, regenerated (or borrowed from other dialects?)
> as "teeny" again.  So again, not absolutely excluding them,
> but studying what difference it makes to patterning if they
> are included vs. excluded.

Much the same comment as with nursery words.

Anyway, words like `teeny' are not so much imitative as expressive or
sound-symbolic.  Basque has a huge number of such words, but I am
deliberately choosing not to exclude them expressly from the initial
list.  Why?  First, because the hopeful long-rangers seeking improbable
relatives for Basque have frequently cited such words as comparanda and
have complained bitterly that I am being circular when I reject these
words as suitable comparanda because of their expressive formation.
Therefore, *demonstrating* that these words do not look like native and
ancient Basque words is precisely one of the goals I hope to achieve.
Second, because I am confident that these words will be ruled out by the
criteria I have already listed -- notably by the requirement that a word
should be found throughout the language.  With only a few exceptions,
expressive and sound-symbolic formations in Basque are severely
restricted in distribution, being confined in each case to a small area.

One of the few exceptions is <tipi> ~ <tiki> 'small', which satisfies
all six of my criteria and will have to go into the initial list.  But,
in that list, it will stand out like the proverbial sore thumb, because
of its utterly anomalous phonological form:

	(a) it has an initial voiceless plosive;
	(b) it has an initial coronal plosive;
	(c) it has the form CVCV with plosives in both C positions
		and a voiceless plosive in the second C position;
	(d) it exhibits a unique regional variation in form.

In my view, this will be more than enough to remove the word from the
second version of the list, or at least to earn it a flag as an
anomalous item.  Even though it exists everywhere, and even though it is
recorded exceptionally early (about 1400), it looks as much like a
native and ancient Basque lexical item as `pizza' looks like a native
and ancient English word.

> Even asking that a word be found in all or many dialects is not a simple
> criterion:

> There are other subsidiary criteria which can reinforce or undermine
> the probabilities that a word was ancient Basque, which can be
> combined with information about WHICH dialects it is attested in,
> not just how many of them or which branches of the dialect family tree.

Agreed, but.  There exist words which are shared between the eastern and
western dialects of Basque but which are unattested in the central
dialects.  By the peripherality criterion, these words are good
candidates for ancient status: they look like words which were once
universal in the language but which have been lost from central
dialects.  But I still prefer to exclude such words from our initial
list.  Once that initial list is set up, then these east-west words are
the next group to look at, to see if they too have the phonological
shapes of native and ancient words.  But I don't think they ought to be
included at the first stage.  There are hundreds of words found in all
dialects: let's look at those first.

> My point in all of the above is that using simple cutoffs is a kind of
> rush-to-judgement, the opposite of the ability to delay judgements
> which is the hallmark of many good research personalities.
> The reason to avoid simple cutoffs is because it throws away potentially
> important data.

Again I can't agree.

First, cutoffs are not a rush to judgement: they are merely common
prudence, a desire to advance as slowly and as carefully as possible.

Second, nothing is thrown away.  All data remain available for later
consideration, after an initial list is obtained.

As I stressed above, the first goal is to exclude questionable words,
not to avoid excluding genuine ones.

> But there is every reason to USE every one of the criteria
> (those which Larry Trask expressed as cutoffs),
> to use them as LABELS on the vocabulary items,
> which in a computer database can be taken into consideration
> whenever any question is asked of the database.

Sure, but that's a different exercise.

> In most studies of canonical forms, ones which do attempt to purify
> vocabulary lists, I would expect there is a statistical tendency
> known as regression towards the mean, that is, some reinforcement of
> universally typologically dominant patterns, such as CV-CV(-CV)
> syllable structure. It would be where we find that deviations from
> such universal patterns are reinforced statistically by steps of
> "purifying" a vocabulary set that we would have the most interesting
> characteristics perhaps attributable to an early or proto-language.

Well, I am unwilling to assume in advance that CV syllable structures
must have been typical of Pre-Basque.  In fact, my preliminary work
suggests strongly that Pre-Basque had an enormous proportion of
vowel-initial words, probably totaling at least 50% of the recoverable
lexicon, and possibly more.  This I consider unusual, though a query
last year on the LINGUIST List turned up a few other languages with the
same property.

Romance languages generally have a much lower proportion of
vowel-initial words -- for example, a quick trawl of my biggest Spanish
dictionary suggests that about 25% of Spanish words are vowel-initial.
So, if we assume that we should automatically be preferring C-initials,
we are likely to start preferring Romance words to native Basque words.
Advance assumptions about what we `ought' to find are dangerous.

Larry Trask
COGS
University of Sussex
Brighton BN1 9QH
UK

larryt at cogs.susx.ac.uk