Excluding data

ECOLING at aol.com ECOLING at aol.com
Thu Aug 19 14:22:57 UTC 1999

Larry Trask today defended his criteria for what to exclude
from an initial list of potential proto-Basque vocabulary.
Let me start by quoting his point that he is not excluding data:

>Second, nothing is thrown away.  All data remain available for later
>consideration, after an initial list is obtained.

Of course, but the "initial" list can bias things mightily.
Trask quite misunderstands my point about excessive reliance on
canonical forms (CVCV vs. VCVCV etc. etc.).
The criteria for selecting an "initial" list can bias us in many ways.
Excluding sound-symbolic words can artificially and circularly
lead us to expect a much greater conformance to hypothetical
canonical forms that is in fact the case in most languages.
Or the reverse, assuming a simpler set of canonical forms
can promote the exclusion of sound-symbolic words.

Here is Trask's discussion of this,
which I hope is evidently circular in form,
whether or not his /tipi/ ~ /tiki/ is in fact
descended from Proto-Basque.
That is, because this sound-expressive word does not conform
to the canonical form hypothesized for much of the less concrete Basque
vocabulary, therefore we have an extra argument in favor of
excluding it (though he does include this one).  But what if
there are several canonical forms, as in most real-world languages,
some forms occurring more often in sound-expressive vocabulary?

>Anyway, words like `teeny' are not so much imitative as expressive or
>sound-symbolic.  Basque has a huge number of such words, but I am
>deliberately choosing not to exclude them expressly from the initial
>list.  Why?  First, because the hopeful long-rangers seeking improbable
>relatives for Basque have frequently cited such words as comparanda and
>have complained bitterly that I am being circular when I reject these
>words as suitable comparanda because of their expressive formation.
>Therefore, *demonstrating* that these words do not look like native and
>ancient Basque words is precisely one of the goals I hope to achieve.

[Demonstrating that sound-symbolic words have a different canonical
form has little value in arguing that they were not proto-Basque,
precisely because such differences of canonical form do occur in many
real languages, so why not also in (proto-)Basque?]

>Second, because I am confident that these words will be ruled out by the
>criteria I have already listed -- notably by the requirement that a word
>should be found throughout the language.  With only a few exceptions,
>expressive and sound-symbolic formations in Basque are severely
>restricted in distribution, being confined in each case to a small area.

[Expressive and sound-symbolic words are also very much under-recorded
for many languages and language families.  Many of them are not known
to learned scholars who are not native users of the languages, because
they are used in language registers which are never the domain of activity
of those scholars.  So this criterion is PARTLY defective or circular.
There is a partly definitional relation between sound-symbolic and
narrowness of attestation in recordings.]

>One of the few exceptions is <tipi> ~ <tiki> 'small', which satisfies
>all six of my criteria and will have to go into the initial list.  But,
>in that list, it will stand out like the proverbial sore thumb, because
>of its utterly anomalous phonological form:

>   (a) it has an initial voiceless plosive;
>   (b) it has an initial coronal plosive;
>   (c) it has the form CVCV with plosives in both C positions
>       and a voiceless plosive in the second C position;
>   (d) it exhibits a unique regional variation in form.

>In my view, this will be more than enough to remove the word from the
>second version of the list, or at least to earn it a flag as an
>anomalous item.  Even though it exists everywhere, and even though it is
>recorded exceptionally early (about 1400), it looks as much like a
>native and ancient Basque lexical item as `pizza' looks like a native
>and ancient English word.

[pizza is not sound-symbolic, and uses odd (for English) spelling,
but never mind...]

To me, this is a clear indication not of something wrong with including
the word /tipi/ ~ /tiki/ as potentially proto-Basque,
but something inconsistent in the total set of criteria,
UNDER THE CONDITIONS that we are trying
to force a consistent canonical form onto our hypothesized "initial list"
of potential proto-Basque vocabulary.  That goal may be wrong-headed.

The criteria, and how they are used, are themselves JUST A SET
OF TOOLS, and those tools, and how they are in fact used,
need themselves to be evaluated to see whether they lead to errors.
The exclusion of a sound-symbolic word on the grounds that it
has a different canonical form from other vocabulary is a clear error,
given the factual knowledge that such sound-symbolic forms
do often (world-wide) have different limitations on their phonological

More generally, Trask introduces his message today with the following
response to my comment:

>> While early attestation is obviously a valuable cutoff (3),
>> if the aim is to achieve the absolutely purest vocabulary set possible,
>> it may run the error of excluding much authentic native Basque
>> vocabulary which simply happened not to be recorded "early".
>> Can we rather have several "degrees of earliness", or distinctions
>> of WHICH sources of attestation?  Trask does some of the latter,
>> suggesting to exclude some sources which he believes are
>> particularly unreliable.


>Of course, any cutoff criterion is likely to exclude a few genuinely
>native and ancient Basque words, but it will certainly also exclude a
>much larger number of words which are not native and ancient.

Agreed, usually so.

But I must flatly disagree with the following quote.
A primary object can legitimately be to attempt
to distinguish proto-Basque words from words now used in Basque
which do not descend from proto-Basque.
But that is not at all the same as this:

>And the primary object is
>to exclude the words which should be excluded, not to
>include every single word which should be included.

It is not one or the other, it is both.

Any tool for achieving this can make either sort of error,
errors of inclusion or errors of exclusion.  It is not simple.
The mistake here, in my view, is very much akin to the mistake
of rushing to judgement discussed in another message today,
when we are dealing with more complex situations of
provisional hypotheses.
The various criteria for what is a proto-Basque word
interact in complex and sometimes contradictory ways.
There is no point in hiding these difficulties to render a "final"
decision, even if it is claimed to be only an "initial" list.
Rather, simply add to our knowledge of the data set,
so anyone can at any time reconsider the interedependence
of any of our criteria (tools).

On distribution:

>> Even asking that a word be found in all or many dialects is not a simple
>> criterion:

>> There are other subsidiary criteria which can reinforce or undermine
>> the probabilities that a word was ancient Basque, which can be
>> combined with information about WHICH dialects it is attested in,
>> not just how many of them or which branches of the dialect family tree.

>Agreed, but.  There exist words which are shared between the eastern and
>western dialects of Basque but which are unattested in the central
>dialects.  By the peripherality criterion, these words are good
>candidates for ancient status: they look like words which were once
>universal in the language but which have been lost from central
>dialects.  But I still prefer to exclude such words from our initial
>list.  Once that initial list is set up, then these east-west words are
>the next group to look at, to see if they too have the phonological
>shapes of native and ancient words.  But I don't think they ought to be
>included at the first stage.  There are hundreds of words found in all
>dialects: let's look at those first.

I think Trask's procedure here is rather unlike that of most comparativists,
in that they usually LABEL which languages or dialects a form occurs in,
and if it occurs in certain combinations of descendants, it is treated
as probably descended from the proto-language.  [For Basque, there
might be a special case, if one hypothesized borrowings from
RELATED (Romance) languages separately into different peripheral
Basque dialects.]

>> But there is every reason to USE every one of the criteria
>> (those which Larry Trask expressed as cutoffs),
>> to use them as LABELS on the vocabulary items,
>> which in a computer database can be taken into consideration
>> whenever any question is asked of the database.

>Sure, but that's a different exercise.

Not a different exercise at all, the same one, simply not
throwing away data or making it hard to access or reconsider.

The ability to have all of the data available in a database,
tagged and labeled according to all of the criteria Trask mentions,
differs from Trask's procedure primarily in not having to render
final judgements prematurely, because we can always go back and
re-weight the criteria, look at them again from a different perspective.
Without such databases, decisions are made once and cannot be
easily reconsidered later with the benefit of full information.

I think Trask misunderstood my reference to deviations from
typologically common canonical form.  In fact, it is precisely the
extra frequency of Vowel-initial words which Trask notes for
pre-Basque vocabulary which is significant (despite the existence
of some other languages which have this also), whereas a predominance
of CVCV- forms probably would not be, since it is world-wide
more common.

So, once again, let's keep the data available,
analyze it with all the care Trask obviously can muster,
but not hide the structure of the analysis,
nor make it needlessly difficult to go back and reconsider particular
decisions or whole groups of decisions at a later time.

Lloyd Anderson
Ecological Linguistics

More information about the Indo-european mailing list