Sample bias, word length, frequency, circularity

Sun Dec 12 02:28:30 UTC 1999

This message is on a distinct source of sample bias,
one not in these discussions previously.
And on one explanation of
what was meant by "circularity" earlier.

***

It is well known that more common lexical items
are, as a statistical matter, shorter,
and rarer lexical items are, as a statistical matter, longer.
This is true world-wide, I assume the point is not debated?

If this is true, then a selection of vocabulary which is very
strongly biased towards the most common lexical items
in a language will also be strongly biased towards the shorter items.
A sample selected with a very strong bias towards shorter
words will be unrepresentative of the language it is drawn from,
especially if one is aiming at generalizations about canonical forms.
(It will also be skewed towards more grammatical lexical items,
"when, the, very, some, that, do".)

This matter of a bias in word length
is another factor which I did not previously mention explicitly,
which argues against having too small a list of items included
in a sample, if one wishes to draw conclusions of general validity,
when one is trying to determine the canonical forms
even for monomorphemic lexical items in early Basque
(or any other language),
and even if (as Trask very nearly specifies today),
one is not interested initially in canonical forms of expressives.

Multisyllabic forms which actually occurred
in spoken early Basque are of course monomorphemic
if they are not analyzable into components within Basque.
Whether or not they will be picked up by a particular
sampling technique -- that is a one of our questions.

So the discussion in another message I have sent today,
under the title "Excludes much", is highly relevant to the
question of possible distortion of a sample of vocabulary.

That other message contrasts roughly basic vocabulary
of ordinary life on land and sea with
the kind of basic vocabulary which is of highest frequency,
independent of subject matter of texts,
which will therefore be most likely to be found
in at least four out of five dialect groups,
following Trask's criteria for inclusion.

Lloyd Anderson

***

On other matters,
I think we have reached some partial closure on one point.

Trask has written:

>But who ever said I was interested in canonical forms
>"for the language as a whole"?

I believe the original statements did not specify exclusions,
and aimed at general validity, so one could reasonably
assume they were intended for the language as a whole.
The exclusion of expressives,
*or systematically of any other group of words*
(such as the longer words, as noted above),
through any aspect of the sampling procedure,
would of course tend to invalidate such general validity.

***

Trask has affirmed:

>Expressive formations may be constructed
>according to different rules from ordinary words.

>Indeed they are, but how can I hope to establish
>this objectively unless I
>*first* identify the rules for ordinary words?

The methodological point would be I think that
one cannot separate out ordinary words from
expressives (or from any other category of words)
in one's selection of data, without knowing how to
recognize what expressives are.

But as I have previously stated, as long as the
expressives are not excluded as "not candidates
for early Basque", they will be considered at some point.

It is the phrasing which Trask used early in these discussions,
which seems to aim at conclusions valid for "early Basque",
and linked these with his criteria,
rather than saying that his conclusions would be
valid for "a subset of early Basque",
namely that subset which is selected by his criteria,

which seems to place his criteria above
themselves being questioned.

Perhaps this way of stating it makes it clearer
what has been meant by "circularity".

If we say that
conclusions from analyzing
a particular subset of early Basque will
be valid for the subset of early Basque
selected by the criteria used to select it,

we of course have a plausibly valid statement.
That makes explicit what the limitations may be.
Trask's earlier statements seemed to aim at much
broader validity, and omitted the limitations.
That is at least one reason why they have seemed
rather circular to a number of us.

*********

Neither Trask's criteria nor any I have suggested
single out expressives for either exclusion or exclusion
(I have always granted that; it would be nice if
Trask would grant it in return).
Yet the point I have made is that Trask's criteria
do so indirectly, because of the bias in written
attestations of expressives, probably world-wide,
and that limits the general validity of any results
he would get from his sample.  Since he has slightly
narrowed his claims, this is of less concern.

However, Trask's criteria also bias against
longer lexical items, and that is not so trivial
by any means.  That is a bias against some canonical
forms, at least statistically, and a bias against
ordinary vocabulary which is not in the highest
frequency class.

Regarding this, which I intended only to be very careful:

>(2) Words in particular semantic areas
>may be constructed according to
>different rules from other words.

Trask responds:

>This I find unworthy of taking seriously.
>I know of no language in which such
>a thing happens, and I certainly know of
>no reason to suspect that it might be true of Basque.

If one uses the term "semantic areas" rather more broadly
than usual, I think I have indicated that it is true of every
language, there are strong differences between the
semantic ranges occurring in the most frequent,
statistically shortest lexical items, and those occurring
in the rarer, statistically longer lexical items.
Since that application of the phrase certainly may not
have occurred to Trask, I can understand his response.
I would think that in most languages, color terms would
not have different canonical forms from names of animals.
But they might.  I think we agree here.

>Of course, if it *is* true of Basque, then that fact should
>emerge from my investigations.  But I'm not holding my breath.

Well, yes, in a sense, but only when the sample is extended
to include them.  It cannot emerge from a sample which
does not include them, but only from a contrast between
that first sample and a larger sample or another sample.

So again we come back to the point that it is perfectly fine
for Trask to use a very narrow sample as a starting point,
as narrow as he wishes.
But he cannot then draw conclusions of general validity
for the language as a whole, or even, given the probable sample bias,
for the monomorphemic lexical items of early Basque.
Biases in his sample selection almost certainly
work against that.