9 specifics on Including and excluding data

Thu Oct 28 16:50:50 UTC 1999

This is a consolidation of a list of nine (9) specific suggestions
to modify Trask's criteria for data to be included as
potential candidates for early Basque.
It was originally sent on October 20th, this is a revision on October 28th.

Seven of these specific proposals were stated previously in messages
available to Trask before he wrote the message to which this is a reply,
in which he said that specific proposals had not been received.
Two of them were sent to the list before receiving Trask's message
of October 20th, but Trask would not have seen them yet.

***

>Lloyd and others have repeatedly implied that there is, or might
>be, something wrong with my criteria.  I have therefore asked for alternative
>criteria.  I have seen none, except for Lloyd's suggestion that 1700 is
>a better cutoff date than 1600 ... [and one on expressives, see below]

I have repeatedly expressed my suggestions for improving criteria.
That *includes* dropping some.

This message is not a mere repeat listing of what has been posted
previously.
To make it more useful, I haved restated some crucial parts
which Trask missed in referring to them, as well as adding further
*explanations* and *examples*, which most readers will see
as merely details implied by what was already stated.

***

Number one.
Counteracting biases of documentation by subject matter.
Previously stated, as Trask now agrees,
though his restatement makes it appear rather trivial,
losing its principled basis and therefore greatly reducing its reach.

Number one is not merely the 1700 rather than 1600 cutoff date,
but was based on a more principled suggestion
that we should avoid biasing by the sheer accident of the limited
nature of available documentary evidence for particular time periods.
In attempting to find the oldest native Basque vocabulary,
there will be semantic domains which are essentially excluded
by such sheer accidents, and for these we can take the earliest
documentary evidence available which covers those semantic
domains, not quite "whatever the date", but with considerable
leeway in accepting dates later than 1700 if necessary to get
documentation for a particular subject matter.
The point was NOT the date (1700 vs. 1600),
the point was to avoid the accidents of exclusion.
Its implications are both much broader and much more specific.

***

Number two.
Breadth of attestation required made proportionate to
breadth of documentation by subject matter.
Previously stated.
Not noted by Trask in the message to which I am replying.

     I also proposed a still more refined approach in which the
number of dialects we wish to have represented would vary
precisely in order to counteract the accidents of preservation of
documents in particular subject matters in only some dialects.
If for example documents referring extensively to colors
were only attested in three dialects, then attestation in only two
dialects might count as sufficient to satisfy adequately
the criterion of breadth of attestation.

***

Number three.
Breadth of attestation.
Previously stated.
Not noted by Trask in the message to which I am replying.

     I suggested very early that attestation in all dialects was not required.
Some intermediate would be appropriate, though I did not give
a particular number.
Even without a particular number, this is still a specific suggestion.
Can it be made still more specific?  Of course.  Almost anything can be.
In the example just above, for example,
I took two out of three dialects as sufficient.
Three out of five would also be a reasonable criterion
(not as a cutoff, but as a sufficient *minimum* on a criterion
of measured degree of breadth of distribution).
If only two dialects are available (for the relevant subject matter),
I would personally take one as sufficient for a *minimum*.
Remember that by suggestion number seven,
all of this information is kept, by tagging on the lexical item,
so we can still distinguish cases later if we wish.

***

Number four.
Morphemic composites as evidence for their parts.
This one is a recent refinement, in response to the example
of <uko> 'forearm' included in <ukondo> 'elbow'.

The mainstream would I think have included <uko>
on the basis of <ukondo> almost without question,
because the parts of the (compound?) are transparent,
and therefore the root from which it is formed must be
at least as ancient or more ancient than the compound.
I would not have dreamed it was necessary to state explicitly
that morphologically complex items can give evidence
for the earlier use of their morphemic parts,
since I assume linguists generally take it for granted
(except in a few special cases like back-formations).

In a case in which there is strong support from
inclusion of a root in a compound or derivative
in another dialect, it can even be possible to include
a form attested (as bare root or stem) only in one dialect.
IF (note IF) we were using the criterion of three dialects
out of five, then we would merely need <uko> in one
dialect and <ukondo> in two other dialects to reach the
criterion of a minimum of three dialects for the root <uko>,
though of course that would be only two dialects for the
compound <ukondo> so the composite form itself
would not exceed *this* minimum if it were
attested in only two.

Here is the information from Trask:

>For example, <uko> 'forearm', is nowhere attested before the 17th-century
>writer Oihenart, but then it appears to be attested *only* in Oihenart,
>so it will fail to be included anyway.

>But this example raises another interesting point.
>Though <uko> itself is not found outside of Oihenart,
>its transparent compound <ukondo> (and variants>
>'elbow' (from <uko> plus <ondo> 'bottom') is close to universal in the
>language, and recorded from 1596.
>Now <ukondo> must be excluded as obviously
>polymorphemic, but I will have to decide whether its existence should or
>should not license the listing of <uko>,
>which itself does not meet my criteria.
> At the moment, I have not yet decided, though I lean toward the negative.

The exclusion of multimorphemic items is a very strong bias against
the result being a representative cross-section,
even of the *roots* of a normal language
(for those normal languages which do have multimorphemic items).
While the *end goal* may be a list of morphemes or even root morphemes,
the data used to obtain these should of course include multi-morphemic
items.  To do otherwise is an arbitrary, unjustified bias against the
normality of languages which do contain multimorphemic words,
and some morphemes including some roots occur only in such words.

***

Number five.
Balanced use of criteria, each alone not decisive.

This one has been made explicit only recently, as soon as I became
consciously aware of how near Trask comes to saying that each
criterion must be satisfied independently of the others,
of what he perhaps means by "best" examples, rather than merely
very good candidates for early Basque.
     Numbers two and four are examples of the
INTERACTION of criteria, that no criterion by itself should be
determining of inclusion or exclusion.  I took this for granted,
but now make it explicit.  Combine the "scores" from several
criteria, make a balanced decision.  That is specific, and can be
made more so.  It is fairly common practice in comparative linguistics
to have combined lists, those proposed cognates which seem perfect
both on sound correspondences and on semantics, those which
are perfect on sound correspondences but slightly odd on semantics,
and so on, with greater detail and elaboration.  No reason not to
do that here also.

***

Number six.
Avoiding biases against expressives.
Previously stated, as Trask agrees,
though he very much misrepresents the content of this one.

>I have seen none, except for Lloyd's suggestion  ...[one above, and]
>and his insistence that sound-symbolic words
>should be self-consciously added to the list according to no specified
>criteria.

This is most emphatically NOT what I suggested.
I was explicit that I suggested dropping or modifying criteria which
had the *effect* of biasing selection against any category of words,
that I happened to be qualified to talk about why a bias against
sound-symbolic words might distort any conclusions about
canonical forms.
That is quite another matter from self-consciously insisting
on adding expressives.

>I have agreed that the first is possible,

[but Trask argued against the important part of it,
suggesting that a list of ranges of vocabulary subject matter are
not native, a topic which might be explored in greater depth
by those who know the field, I disclaim competence here
except to evaluate the logic of proposals]

>but dismissed the second
>as lacking in specifics and intrinsically circular.

As Trask restated it, I would agree that self-consciously adding
expressives to the list would be unprincipled,
if that were done merely for the purpose of adding expressives.
But as reiterated above, that was most emphatically NOT what
I proposed.  I proposed rather eliminating artificial barriers to
their inclusion, through accidents of more limited attestation
and the interaction of supposed criteria for number of dialects
required in attestations.

If expressives are attested only in one dialect,
then only one dialect would be sufficient as a bare minimum
satisfaction on that criterion of distribution.
(an instance of suggestion number two above,
not at all specific to expressives).
In fact, I gather from some other remarks
by Trask quite recently, that there are numerous alternative
words for "butterfly".  If we had a full set of these displayed
for us, who knows what we might learn about whether
any particular forms should be considered inherited from
early Basque?  And about our own thinking about criteria
for inclusion and exclusion.  Good examples have a way of
revealing paradoxes of thinking, or otherwise sharpening
our thinking.

***

Number seven.
Tagging of items, rather than inclusion and exclusion
Previously stated.
Not noted by Trask in the message to which I am replying.

In redefining where on the continuum to draw the line for
"best" examples (since to be meaningful we must recognize
that is what anyone does by choosing or adjusting their criteria),
we can gain the benefits of more information and lose nothing.
Any information that someone might have used in a criterion
dictating exclusion can be included in a computer database
as a tagging of the individual items.  Additional information
can also be added as tagging.  The benefits of being able to
consider alternative hypotheses so quickly and easily were
discussed, and the fact that some questions will simply not be
asked if it is too difficult to ask them.

***

Number eight.
Slight global preference to include basic vocabulary,
     unless provably borrowed.
Previously stated.
Not noted by Trask in the message to which I am replying.

The use of the Swadesh list or other list of *relatively* more
basic vocabulary could be used to give an extra point or fraction
of a point to items of basic vocabulary, perhaps causing some
of them to be included which otherwise would not rate highly
enough on the balanced combination of other criteria.
The principled basis for this is that languages do have basic
vocabulary, that basic vocabulary is, statistically only now,
relatively more resistant to replacement by loanwords,
and that the positing
of a set of vocabulary for an early form of a language should
probably include lexical items for most such basic vocabulary.

This would not overrule clear cases of *known* borrowings,
in such case we might indeed appropriately have a "trump"
criterion for exclusion, but it should be used as a "trump"
only when *known* is meant very strictly, not mere speculation.
Trask's example of "mountain" is probably such a case,
to be excluded as an obvious loan.

But that does not contradict using this criterion
to evaluate whether we may have exluded too much, overall.
In effect, this suggestion shifts the burden of proof slightly,
so that to exclude an item of basic vocabulary we need stronger
evidence than we would for non-basic items.

What exact proportion of a Swadesh list
might we want to be sure is included?
I do not presume to know,
and there certainly are differences among languages in the
proportion of basic vocabulary which is native.  But even if
not precisely quantified, this criterion is specific and has a
principled basis.  That basis relies on the idea that we are
evaluating our criteria for their appropriateness, just as we
are using them to evaluate items for inclusion or exclusion
as within the bounds of "best" candidates for early Basque.

***

Number nine.
Avoiding cascading errors,
     not insulating steps in the reasoning.
Previously stated.
Not noted by Trask in the message to which I am replying.

It is important to avoid circularity, by not artificially insulating
steps in the reasoning process, by not allowing selection of data
to be dictated by the hypotheses one has, more than absolutely
necessary.  This was stated first in regard to canonical forms,
because of the likelihood that the initial selection under Trask's
criteria would bias against expressives which (Trask indicated)
do indeed have some different canonical forms from other vocabulary.
In other words, we should avoid excluding these from the
beginning, so that the initial results will include a full range of
native canonical forms, and will not bias later work circularly to
incorrectly exclude items on the basis of a narrow set of formulas
for canonical forms, merely because almost no examples of such
canonical forms typical of expressives happened to be included
at stage one.

***

Sincerely,
Lloyd Anderson