Sample bias, word length, frequency, circularity

Wed Dec 22 11:01:00 UTC 1999

Lloyd Anderson writes:

> It is well known that more common lexical items
> are, as a statistical matter, shorter,
> and rarer lexical items are, as a statistical matter, longer.
> This is true world-wide, I assume the point is not debated?

> If this is true, then a selection of vocabulary which is very
> strongly biased towards the most common lexical items
> in a language will also be strongly biased towards the shorter items.
> A sample selected with a very strong bias towards shorter
> words will be unrepresentative of the language it is drawn from,
> especially if one is aiming at generalizations about canonical forms.

Not really.  I am only interested in monomorphemic words, and
monomorphemic words tend to be short, while long words tend to be
polymorphemic, in Basque as in all the languages I know anything about.

Consequently, Lloyd's objection could only constitute a problem for me
in the following scenario:

	Pre-Basque had lots of long monomorphemic words as well as short
	ones, but, for some reason, the long monomorphemic words have been
	generally lost from the language, while the short ones have
	preferentially survived.

And I don't see this as a plausible scenario.

Larry Trask
COGS
University of Sussex
Brighton BN1 9QH
UK

larryt at cogs.susx.ac.uk