Sample bias, word length, frequency, circularity

Larry Trask larryt at cogs.susx.ac.uk
Tue Dec 21 12:38:20 UTC 1999


Lloyd Anderson writes:

> This message is on a distinct source of sample bias,
> one not in these discussions previously.
> And on one explanation of
> what was meant by "circularity" earlier.

> It is well known that more common lexical items
> are, as a statistical matter, shorter,
> and rarer lexical items are, as a statistical matter, longer.
> This is true world-wide, I assume the point is not debated?

I know of no evidence against this, and I know of some hard evidence to
support it.

> If this is true, then a selection of vocabulary which is very
> strongly biased towards the most common lexical items
> in a language will also be strongly biased towards the shorter items.
> A sample selected with a very strong bias towards shorter
> words will be unrepresentative of the language it is drawn from,
> especially if one is aiming at generalizations about canonical forms.

Perhaps, but recall that I am not interested in finding canonical forms
for word-forms.  I am only interested in finding morpheme-structure
conditions for monomorphemic lexical items.  Hence polymorphemic
word-forms are of no interest or relevance -- and most polysyllabic
word-forms are also polymorphemic.

> (It will also be skewed towards more grammatical lexical items,
> "when, the, very, some, that, do".)

Not really, in my case.  Basque makes much heavier use of grammatical
affixes than does English, and the Basque equivalents of such English
words as 'when', 'the', 'if', 'to', 'with', 'of', 'in', 'toward',
'from', 'for', and 'because' are bound affixes, and hence will not
qualify for my list.  And a number of others are bimorphemic, such as
those meaning 'although', 'since' (causal), 'after', 'before', and
'together', and these too will not make my list.  The number of
monomorphemic grammatical words is really rather small.

> This matter of a bias in word length
> is another factor which I did not previously mention explicitly,
> which argues against having too small a list of items included
> in a sample, if one wishes to draw conclusions of general validity,
> when one is trying to determine the canonical forms
> even for monomorphemic lexical items in early Basque
> (or any other language),
> and even if (as Trask very nearly specifies today),
> one is not interested initially in canonical forms of expressives.

> Multisyllabic forms which actually occurred
> in spoken early Basque are of course monomorphemic
> if they are not analyzable into components within Basque.
> Whether or not they will be picked up by a particular
> sampling technique -- that is a one of our questions.

Any such Pre-Basque words which still exist today will, of course, be
picked up if they meet my criteria, and not otherwise.  But the vast
majority of three-syllable words in Basque, and probably all longer
words, are transparently either polymorphemic or borrowed.  There is
nothing surprising about this: the same is true for English.

> So the discussion in another message I have sent today,
> under the title "Excludes much", is highly relevant to the
> question of possible distortion of a sample of vocabulary.

> That other message contrasts roughly basic vocabulary
> of ordinary life on land and sea with
> the kind of basic vocabulary which is of highest frequency,
> independent of subject matter of texts,
> which will therefore be most likely to be found
> in at least four out of five dialect groups,
> following Trask's criteria for inclusion.

But we can't decide in advance which words "ought to" be indigenous and
monomorphemic.  That's a matter for empirical investigation.

> On other matters,
> I think we have reached some partial closure on one point.

> Trask has written:

>> But who ever said I was interested in canonical forms
>> "for the language as a whole"?

> I believe the original statements did not specify exclusions,
> and aimed at general validity, so one could reasonably
> assume they were intended for the language as a whole.

No: native, ancient and monomorphemic lexical items.  That's all.
I think I've been pretty explicit about this.

> The exclusion of expressives,
> *or systematically of any other group of words*
> (such as the longer words, as noted above),
> through any aspect of the sampling procedure,
> would of course tend to invalidate such general validity.

No; not at all.  Polysyllabic words are excluded by definition: they are
not relevant to my task.  Obviously expressive words hardly ever satisfy
my criteria, from which the most appropriate conclusion appears to be
that *these particular words* are not ancient.  Of course, Pre-Basque
doubtless possessed *some* expressive words, but there is no evidence to
support a claim that these were identical to the modern ones.  Such
evidence as we have suggests that expressive formations in Basque have
been subject to constant renewal and replacement.

> Trask has affirmed:

>> Expressive formations may be constructed
>> according to different rules from ordinary words.

>> Indeed they are, but how can I hope to establish
>> this objectively unless I
>> *first* identify the rules for ordinary words?

> The methodological point would be I think that
> one cannot separate out ordinary words from
> expressives (or from any other category of words)
> in one's selection of data, without knowing how to
> recognize what expressives are.

Exactly.

> But as I have previously stated, as long as the
> expressives are not excluded as "not candidates
> for early Basque", they will be considered at some point.

I repeat: *all* words will be considered from the beginning, evaluated
according to my criteria, and included if and only if they satisfy those
criteria.

> It is the phrasing which Trask used early in these discussions,
> which seems to aim at conclusions valid for "early Basque",
> and linked these with his criteria,
> rather than saying that his conclusions would be
> valid for "a subset of early Basque",
> namely that subset which is selected by his criteria,

Careful!  I cannot directly examine Pre-Basque at all.  If I could, I
would simply do so.

Instead, I must first attempt to compile a list of the best candidates
for Pre-Basque status.  Of course, I can't hope to recover the entire
language, in all its detail.  But, if my objective criteria lead to a
set of several hundred best candidates which conform strongly to certain
generalizations, then I think we have the basis of some conclusions.
And that is the point of the exercise.

> which seems to place his criteria above
> themselves being questioned.

Er -- what?

> Perhaps this way of stating it makes it clearer
> what has been meant by "circularity".

Not to me, I'm afraid.

> If we say that
> conclusions from analyzing
> a particular subset of early Basque will
> be valid for the subset of early Basque
> selected by the criteria used to select it,

> we of course have a plausibly valid statement.
> That makes explicit what the limitations may be.
> Trask's earlier statements seemed to aim at much
> broader validity, and omitted the limitations.
> That is at least one reason why they have seemed
> rather circular to a number of us.

I don't see any circularity.

Anyway, I am still asking Lloyd for his set of alternative criteria,
ones which he thinks are better than mine.  *Explicit* criteria, I mean,
not generalities.

Lloyd, should Basque <tutur> 'crest' be included in my list or not?  On
the basis of what criteria should the question be answered?

How about an answer?

> Neither Trask's criteria nor any I have suggested
> single out expressives for either exclusion or exclusion
> (I have always granted that; it would be nice if
> Trask would grant it in return).

I would be happy to grant this if Lloyd would only say it explicitly.
But, frequently, Lloyd has appeared to say that some kind of special
provision should be made for expressive formations.

So, Lloyd, are you now agreeing to the following?

	Expressive formations should be subject to no special treatment
	at all, but must be treated just like all other words, according
	to exactly the same criteria, whatever those are.

Yes or no?

> Yet the point I have made is that Trask's criteria
> do so indirectly, because of the bias in written
> attestations of expressives, probably world-wide,
> and that limits the general validity of any results
> he would get from his sample.  Since he has slightly
> narrowed his claims, this is of less concern.

"Narrowed my claims"?  How?  What claims?  I'm still doing just what I
said I was doing at the beginning.

> However, Trask's criteria also bias against
> longer lexical items, and that is not so trivial
> by any means.  That is a bias against some canonical
> forms, at least statistically, and a bias against
> ordinary vocabulary which is not in the highest
> frequency class.

Again: long words are hardly ever monomorphemic.  Rare words, even if
monomorphemic, are unlikely to survive for 2000 years in an unwritten
language.

Whoops -- gotta go.  Apologies for cutting this short.

Larry Trask
COGS
University of Sussex
Brighton BN1 9QH
UK

larryt at cogs.susx.ac.uk



More information about the Indo-european mailing list