Excluding data

Larry Trask larryt at cogs.susx.ac.uk
Tue Aug 24 16:12:53 UTC 1999

On Thu, 19 Aug 1999 ECOLING at aol.com wrote:

> Larry Trask today defended his criteria for what to exclude
> from an initial list of potential proto-Basque vocabulary.
> Let me start by quoting his point that he is not excluding data:

>> Second, nothing is thrown away.  All data remain available for later
>> consideration, after an initial list is obtained.

> Of course, but the "initial" list can bias things mightily.
> Trask quite misunderstands my point about excessive reliance on
> canonical forms (CVCV vs. VCVCV etc. etc.).

But I am not relying *at all* on canonical forms in setting up my list:
I am relying only upon the criteria I cited earlier.  It is my hope --
indeed, my belief -- that canonical forms (morpheme-structure
conditions) will *emerge* from the data.

> The criteria for selecting an "initial" list can bias us in many ways.
> Excluding sound-symbolic words can artificially and circularly
> lead us to expect a much greater conformance to hypothetical
> canonical forms that is in fact the case in most languages.
> Or the reverse, assuming a simpler set of canonical forms
> can promote the exclusion of sound-symbolic words.

But I have *not* excluded sound-symbolic words other than obvious
imitative words like <mu> `moo' and <tu> `spit'.  Please read what I've
written.  I wrote quite explicitly that I was *declining* to exclude
expressive and sound-symbolic formations, *in order to avoid possible
circularity*, but in the confident expectation that such forms would be
excluded anyway by my other criteria -- notably, by their limited

> Here is Trask's discussion of this,
> which I hope is evidently circular in form,
> whether or not his /tipi/ ~ /tiki/ is in fact
> descended from Proto-Basque.
> That is, because this sound-expressive word does not conform
> to the canonical form hypothesized for much of the less concrete Basque
> vocabulary, therefore we have an extra argument in favor of
> excluding it (though he does include this one).  But what if
> there are several canonical forms, as in most real-world languages,
> some forms occurring more often in sound-expressive vocabulary?

This is possible, of course.  But, in the Basque case, practically every
item that I would want to regard as sound-symbolic or expressive will be
excluded from my initial list for other, independent, reasons.  The only
one I can think of that won't be is the universal and early-attested
word <tipi> ~ <tiki> `small'.  But this word, though it will make the
initial list, will stand out from all the other words in the list to an
almost wild degree.

> [Demonstrating that sound-symbolic words have a different canonical
> form has little value in arguing that they were not proto-Basque,
> precisely because such differences of canonical form do occur in many
> real languages, so why not also in (proto-)Basque?]

Because the data are overwhelmingly against any such suggestion, in the
Basque case.

> [Expressive and sound-symbolic words are also very much under-recorded
> for many languages and language families.  Many of them are not known
> to learned scholars who are not native users of the languages, because
> they are used in language registers which are never the domain of activity
> of those scholars.

True enough, but this is no more than an argument for doing careful
work.  More particularly, it is an argument for being intimately
familiar with the languages you are working on, instead of extracting
words blindly from somebody else's dictionary.  And anybody who's ever
read three of my postings will know that I wholeheartedly endorse just
this position.

> So this criterion is PARTLY defective or circular.

No, it isn't.  Not at all.  I wish you'd stop describing almost
everything I say as "circular".  ;-)

> There is a partly definitional relation between sound-symbolic and
> narrowness of attestation in recordings.]

Definitional, my censored.  A word is not sound-symbolic because it is
sparsely attested, nor is it not sound-symbolic because it is widely
attested.  Basque <tipi> ~ <tiki> is attested everywhere and at all
periods, and yet I still believe it is sound-symbolic, just like English
`teensy', because of its strange form.  Basque <iraatsi> `carve' is a
hapax, but I don't believe it's sound-symbolic.


>> [Basque <tipi> ~ <tiki>] looks as much like a
>> native and ancient Basque lexical item as `pizza' looks like a native
>> and ancient English word.

> [pizza is not sound-symbolic, and uses odd (for English) spelling,
> but never mind...]

I wasn't suggesting that it was.  If this bothers you, try English `zap'
instead.  This look to you like a native and ancient English word?
You expect to find a verb <zappian> in the Old English Bible?

> To me, this is a clear indication not of something wrong with including
> the word /tipi/ ~ /tiki/ as potentially proto-Basque,
> but something inconsistent in the total set of criteria,
> UNDER THE CONDITIONS that we are trying
> to force a consistent canonical form onto our hypothesized "initial list"
> of potential proto-Basque vocabulary.  That goal may be wrong-headed.

Once again, you are accusing me of something I haven't done.

I am *not* trying to "force a consistent canonical form" onto the words
in my initial list.  I am merely proposing to set up an initial list, by
my criteria, in order to see what emerges.  Are my postings not written
in English, or what?  ;-)

> The criteria, and how they are used, are themselves JUST A SET
> OF TOOLS, and those tools, and how they are in fact used,
> need themselves to be evaluated to see whether they lead to errors.
> The exclusion of a sound-symbolic word on the grounds that it
> has a different canonical form from other vocabulary is a clear error,
> given the factual knowledge that such sound-symbolic forms
> do often (world-wide) have different limitations on their phonological
> forms.

But I'm *not* excluding the damn word because it's sound-symbolic.  The
word <tipi> ~ <tiki> meets all my criteria, and so it will appear in
the initial list.  Didn't I say that?  But, within that list, it will
stand out *com' una casa*, as they say in Spanish.


>> And the primary object is
>> to exclude the words which should be excluded, not to
>> include every single word which should be included.

> It is not one or the other, it is both.

No.  This is not possible.  We have to choose one or the other, and I
choose the first.  *Once* we have excluded the words which must be
excluded, *then* we can turn our attention to seeking out words which
have been provisionally, but wrongly, excluded.  But we can't do both at

I confess at once that a number of genuinely native and ancient Basque
words will be excluded from my initial list, for various reasons but
mostly from limited distribution.  But they can be picked up later.

> Any tool for achieving this can make either sort of error,
> errors of inclusion or errors of exclusion.  It is not simple.
> The mistake here, in my view, is very much akin to the mistake
> of rushing to judgement discussed in another message today,
> when we are dealing with more complex situations of
> provisional hypotheses.

I am not "rushing to judgement".  I am instead proceeding as cautiously
and as prudently as possible.

> The various criteria for what is a proto-Basque word
> interact in complex and sometimes contradictory ways.
> There is no point in hiding these difficulties to render a "final"
> decision, even if it is claimed to be only an "initial" list.

When I write "initial", I mean "initial", and I definitely do not mean
"final".  Do these words have different meanings in your part of the

And I am not "hiding" any difficulties at all.  Quite the contrary: I am
doing my level best to recognize the difficulties and to address them:
hence my criteria.

> I think Trask's procedure here is rather unlike that of most
> comparativists,

But I'm not *doing* comparison.  I'm doing morpheme structure in an
unrecorded but substantially reconstructed language.

> in that they usually LABEL which languages or dialects a form occurs
> in,

Yes, and so do we.  And a word found nowhere but in one or two dialects
is not going to make it into my initial list -- though it *might* get
into an expanded later list.

>>> But there is every reason to USE every one of the criteria
>>> (those which Larry Trask expressed as cutoffs),
>>> to use them as LABELS on the vocabulary items,
>>> which in a computer database can be taken into consideration
>>> whenever any question is asked of the database.

>> Sure, but that's a different exercise.

> Not a different exercise at all, the same one, simply not
> throwing away data or making it hard to access or reconsider.

> The ability to have all of the data available in a database,
> tagged and labeled according to all of the criteria Trask mentions,
> differs from Trask's procedure primarily in not having to render
> final judgements prematurely, because we can always go back and
> re-weight the criteria, look at them again from a different perspective.
> Without such databases, decisions are made once and cannot be
> easily reconsidered later with the benefit of full information.

Sorry, but this makes no sense to me.

I am compiling a database.  For my purposes, it matters not at all
whether that database takes the form of a tagged corpus on line or
annotated notes on sheets of paper.  On-line databases are easier to
manipulate, that's all.  But, once the database is in place, nothing at
all happens until we decide to do something with it.  I have already
explained what I propose to do with mine.  And, whether on line or on
paper, I still have to choose my criteria and apply them, now don't I?

> So, once again, let's keep the data available,
> analyze it with all the care Trask obviously can muster,
> but not hide the structure of the analysis,
> nor make it needlessly difficult to go back and reconsider particular
> decisions or whole groups of decisions at a later time.

Well, I'm becoming exasperated.  If you think you can do a better job of
elucidating Pre-Basque morpheme-structure conditions than I can, feel

Larry Trask
University of Sussex
Brighton BN1 9QH

larryt at cogs.susx.ac.uk

More information about the Indo-european mailing list