UP and lexicon size

Sun Jul 29 13:19:09 UTC 2001

On Mon, 21 May 2001 Douglas G Kilday <acnasvers at hotmail.com> wrote:

>Robert Whiting (17 May 2001) wrote:

>>In fact there is a sort of Zipf's Law for this because function
>>words are relatively few in number but are used very frequently
>>while the number of content words is huge (and growing
>>constantly) but the individual words are much rarer in use.

>Excuse my skepticism, but I can't believe the _net_ number of
>contentives in a given language is growing constantly. As new
>contentives enter a language, others exit. Lexica don't have
>rubber walls.

And I (Robert Whiting) would have replied:
[I say would have replied because I already had this written as
part of a larger response that I had not had time to finish]

Sure they do.  Ask any lexicographer.  In fact, you don't have to
ask a lexicographer; they will tell you without asking unless you
can figure out a way to stop them.  It is known locally as the
"lexicographer's lament."  Here is a quotation from Frederick C.
Mish, Editor in Chief of Merriam-Webster's Collegiate Dictionary
Tenth Edition (2000), taken from the preface (p. 6a):

   The ever-expanding vocabulary of our language exerts
   inexorable pressure on the contents of any dictionary.  Words
   and senses are born at a far faster rate than that at which
   they die out.

He then goes on to provide some statistical data of the kind that
a professional lexicographer in his position might easily have
access to that back up his statement.

So lexica do have rubber walls.  The lexicon of any language
expands to allow its speakers to talk about whatever they want or
need to talk about.  But I see from your subsequent postings that
you consider all lexicographers to be pathological liars and that
they are just in league with their publishers to get people to
believe that there are so many new words in the most recent
dictionary that everyone has to buy a copy.  But then you go on
to disprove your own point by establishing that it is easy to see
when a new word enters the language, but very difficult to
establish when an old word leaves.  Therefore, by your own
reasoning, new words and senses enter the language faster than
they die out, just because you can never be sure that the old
words are really gone.

What makes the difference, of course, is writing.  If you don't
have writing, once a word is gone, it is gone.  If it isn't used
for three generations, no one knows it, or even that it once
existed (unless it is preserved in a compound that continues in
existence).  Without writing, there is no record of the word so
it can't be revived.  Writing simply provides external storage
for the lexicon of the language.  Without writing, there is no
place to store the lexicon except in the memory of speakers.
Speakers know the words that they know, and that's it.  If they
hear an unfamiliar word, they can't look it up in a dictionary --
they can only ask someone else what it means or try to guess by
analogy or from context.

Now it is quite possible that internal storage is limited.  That
is, if you want to learn a new word you have to forget some word
that you already know.  Since you claim that this is the way that
lexica work, your mind would seem to work this way.  If RAM is
getting full, you will have to write something to the hard disk
before you can put something else in.  If your hard disk is
getting full, you have to delete something before you can add a
new file.  But, by analogy with writing, it is possible to
transfer the old file that you don't need at the moment to
external storage by copying it to a diskette before deleting it.
Then if you find that you need it later, you can always recover
it from the diskette (provided you can remember what diskette it
is on and where you put the diskette).  So there is a difference
between the individual's lexicon and the lexicon of the language.

The individual's lexicon may well be limited by internal storage
(and the amount of internal storage may well vary by individual).
There are different kinds of vocabulary, which require different
kinds of storage.  Active vocabulary consists of words that the
individual uses in his speech production.  This must be kept in
the equivalent of random access memory (constantly available).
Then there is passive vocabulary.  This consists of words that
the individual recognizes and understands but that he does not
use in his own speech production.  This can be stored in a
different location not requiring constant immediate access (on
the hard disk, as it were). Finally, there is occasional
vocabulary, consisting of rare, archaic, specialized, and even
obsolete words, that can be kept in external storage (on a
diskette = in a dictionary) and can be accessed in case of need
(say to read Shakespere or Marlowe, or in case one wants to take
up the study of medieval armor, collect coins, or take up
falconry and so on).

Now this analogy is not precise because we know exactly how much
storage capacity our RAM, hard disk, and diskettes have and how
and how much information can be stored there, but we don't know
much about how things are stored in the human mind or what its
storage capacity is.  Perhaps the human mind does work like a
computer -- but since computers are not capable of the same
things that the human mind is, there is reason to doubt this.
Developing a computer that works like the human mind is the goal
of the AI people (and I wish them luck).  The human mind is
capable of intelligence (well, some human minds are), that is, it
can analyze data and reach a conclusion that it has not been
previously programmed with.

But to get back to the size of the lexicon, it doesn't really
matter what one believes about it.  Professional lexicographers
say that the lexicon of English is constantly expanding;
semi-naive native speakers say that the size of the lexicon is
constant.  Which one are you going to believe?  (I know where I'd
put my money.)  But as I say, it doesn't really matter what you
believe.  Some people believe that the universe is constantly
expanding; some believe that it is in a steady state.  What
people believe about it doesn't affect the way it works.  It will
continue to work the way it does, regardless.  So it is with the
lexicon of a language.  It will work the way it does regardless
of what people believe about it.

There will always be words to let its speakers talk about
whatever they need or want to talk about.  If there is no word
for something that they need or want to talk about, let's say
computers and computer applications, then they will create one:
by borrowing one from another language (can't think of any
computer terminology that has been borrowed from other languages
offhand; since most computer development was done by English
speakers, the terminology has generally been exported to other
languages); or by using an old word with a new sense (mouse,
disk, bug, virus); or by new compounds (software, internet,
floppy disk); or by abbreviation (RAM, ROM, DOS, modem, univac,
awk, perl); or by free invention or neologisms (glitch,
ergonomics); or by expropriating personal names (baud, Turing
machine); etc., etc.

The top-end dictionaries of computer terminology claim to have
around 13,000-15,000 entries.  Even allowing for a reasonable
amount of exaggeration, double-counting, etc., I think we could
safely assume around 10,000 words, expressions, and senses that
have been added to the language in the last half-century in this
one field alone.  So to make your claim reasonable, you will have
to come up with a list of 10,000 words, expressions, and
senses that have been lost from the English lexicon in the past
50 years.  And once you find your first 10,000 lost words, then
we can look at things like aeronautics, microbiology, and nuclear
physics to see how many more you need to keep up.

So I'll excuse your skepticism.  Skepticism is a healthy way to
look at things.  But if you want to be known as a wise man rather
than just a skeptic, you would do well to have evidence to back
up your intuitions and beliefs.

[Anyway, that's what I would have said, had not a number of other
people, such as Larry Trask and Jim Rader, pointed out the
fallacies in the reasoning that led to your conclusion, posted on
Monday, May 28, 2001, that "All things considered, net stasis of
lexical size makes more sense than continuous expansion."

But actually, neither one makes particular sense.  There is
nothing that I can think of that *requires* continuous expansion
of a language's lexicon.  The factor that controls lexicon size
is the number of things (objects, phenomena, and processes), real
or imagined, that the speakers of a language need or want to talk
about.  If this number has net growth, then the size of the
lexicon will increase; if it declines, the lexicon will shrink.
If it is constant, the lexicon will tend to stay about the same
size.  It has simply been observed by lexicographers (and some
linguists) that, historically, the lexicon of English has been
constantly expanding.

On the other hand, if net stasis of lexical size is a linguistic
*requirement*, then it must be a linguistic universal.  Even
among the most fervent seekers after linguistic universals I have
never seen "net stasis of lexical size" proposed as one.
Furthermore, such a requirement raises immediate questions that I
have never seen addressed.  Such questions include:

  Is the fixed size of the lexicon the same for all languages?

  If so, what is the size of the lexicon of every language?

  If not, how is the fixed size of the lexicon for a given
  language determined?

  What factors enter into this determination?

  Are all these factors linguistic, or can they be socially or
  culturally determined?

  How is it determined when the lexicon is full and some words
  must be removed?

  What happens if the required maximum size of the lexicon is
  inadvertently exceeded?  Do the speakers get a certain amount
  of time to remove the excess words or does the system crash?
  What happens to speakers who don't know that the allowed size
  of the lexicon has been exceeded?

When you have answers to questions like these, it might be
possible to consider "net stasis of lexical size" seriously.  Of
course, if the answer to the last question is "nothing", then
there is no basis for the concept because it is undetectable.
That is to say, the *requirement* can't be taken seriously
because there is no penalty for ignoring it.]

But then, on Sat, 14 Jul 2001 Douglas G Kilday
<acnasvers at hotmail.com> wrote Re: Uniformitarian Principle:

<snip>

>Several list-members have invoked a linguistic UP, usually
>without any clear statement, and not always consistently. Larry
>Trask has been the staunchest advocate of the UP on this list,
>yet he has attacked the principle of net lexical stasis,
>apparently believing that the inventory of contentives in a
>language grows continuously (as claimed by Robert Whiting).

First, as I said above, it's not particularly my claim.  Ask any
lexicographer of English.  I just happen to think that it is
obvious that the lexicon of a language can expand to any size
that its speakers want or need.  The lexicon of the individual
may be limited in size, but not that of the language.
Fortunately, not all speakers of the language want or need to
talk about all of the same things.  No speaker can know the
entire vocabulary of the language, but he doesn't have to.  He
needs a core vocabulary to use for general communication purposes
and if he wants to communicate in a specialized field then he
needs additional vocabulary particular to that field.  But the
lexicon of the language contains not only the core vocabulary,
but also *all* of the specialized vocabularies.  It may be
possible that the core vocabulary is more or less static (this
was the assumption of the glottochronologists -- but that didn't
work either), but it may be possible that it isn't.

Second, the UP doesn't restrict what happens (events), but rather
how things can happen (processes).  Lexicon size is not a
process, so the UP doesn't have anything to say about it.  What
the UP says is that the processes by which the size of the
lexicon changes will always have been the same.  Thus we can
expect that languages have always been able to add words by
borrowing, by coining new words (free invention), by compounding,
by analogy, by reanalysis, etc.  That is, we should not postulate
any mode of word formation in prehistory that cannot be observed
today.  However, saying that the UP requires that the size of the
lexicon of PIE had to be the same size as the lexicon of modern
English is a non sequitur worthy of someone with considerably
less linguistic training.

>Now, if the UP and LT's view of lexical growth are both correct,
>the alleged present inflationary situation has _always_
>characterized languages, all of which are therefore, in
>principle, traceable back to a single word. (So was that word
>/N/, /?@N/, /tik/, or /bekos/? Never mind ... rhetorical
>question.)

Non sequitur.  The UP doesn't have anything to say about the size
of lexicons, or of any other population.  Population size is not
a process but a summation over events (births and deaths).  If
total births exceed total deaths then the population is growing;
if deaths exceed births then it is declining; if they are equal
then it is static.  What the UP says is that the methods of birth
and death will have always been much the same, not that the
population size has always been constant.

>For the purposes of this list, I would suggest a statement of
>the UP roughly as follows: "The history and prehistory of
>languages used by anatomically modern humans involve no
>fundamental processes not occurring today."

Adequate, if somewhat wordy.  My professor always expressed it
as:  "Anything that happens later could have happened earlier."
What this means is that the processes that have led to an
observed event (even if that event is unique) could have led to
that event when it wasn't observed because the processes don't
change -- they have always been there.  Note the use of "could
have happened."  This does not mean that such an event "must have
happened."  This limits the value of historical parallels.
Historical parallels only show that a particular event could have
happened at some other time, not that it must have happened.
Showing that it must have happened requires a different kind of
evidence.

For example, the observed breakup of Latin into the Romance
languages does not prove that there was a PIE language that broke
up into the modern IE languages.  It just proves that there could
have been such a language and such a breakup.  But the fact that
from the Romance languages can be reconstructed a proto-language
(proto-Romance) that matches in detail the known parent (Latin)
and that by using the same methods a proto-language of the modern
IE languages can be reconstructed in great detail makes it much
more convincing that there was a PIE language that broke up into
the modern IE languages in the same way that Latin broke up into
the modern Romance languages.

By contrast, for the Altaic languages, the inability to
reconstruct a convincing proto-language for this group means that
the historical parallel of the breakup of Latin into the Romance
languages is not available as evidence of the existence of an
Altaic language family.  Some other mechanism must be invoked to
account for the (mostly typological) similarities of this group
of languages.

>This leaves rather vague the matter of what "fundamental
>processes" are, like "physical forces" in geology. Specific
>examples of results might well be unique, so the UP is _not_
>equivalent to the rather crude synchronic typological arguments
>often encountered in reconstructive debates. It deals with
>dynamics, not statics.

Ah, then you actually realize that the UP has nothing to do with
a static lexicon size.

>An example to which the UP might be applied is the proposal that
>pre-PIE had ergative-absolutive case-marking. If the proponents
>can give clear examples of E-A languages turning into
>nominative-accusative languages during historical times, and in
>the process today, then the proposal is credible. If OTOH the
>record, and current behavior, show that E-A case-marking tends to
>develop out of N-A structure, then the ergative pre-PIE
>hypothesis is in trouble.

No, this is a misuse of negative evidence and the UP.  All the UP
says is that if historical examples of E-A > N-A can be shown
then E-A > N-A could have happened in pre-PIE.  It does not say
that it must have happened nor does it say that N-A > E-A cannot
happen nor does it imply that either E-A > N-A or N-A > E-A has
to happen.  Now admittedly, if there are hundreds of examples of
N-A > E-A and none of E-A > N-A, that makes E-A > N-A in pre-PIE
a much more difficult row to hoe, but it still doesn't make it
impossible.  It just means that convincing evidence has to come
from elsewhere.

This is exactly like Dr. David L. White's contention that finite
verbal morphology can't be borrowed, based on hundreds of examples
of the borrowing of nominal morphology and no clear examples of
the borrowing of finite verbal morphology.  This may make for a
strong presumption -- a heuristic -- but is not convincing.
Evidence has to come from someplace else.  But his contention
that verbs are higher up in the food chain than nouns are and eat
nouns for breakfast is not evidence, it is just a plausible
story.  And, a priori, it is not even particularly plausible,
since there is no a priori reason that speakers should view nouns
and verbs with different levels of awe.  If it is true then, it
must be part of the fabled Universal Grammar, something that
every human being is born knowing.  I'm not saying that he should
give up his idea (just like people shouldn't give up working on
Universal Grammar).  If he keeps working on it and succeeds in
proving that finite verbal morphology can't be borrowed, he may
someday be canonized by the Chomskians as one of the first people
to identify a concrete feature of Universal Grammar.

But to see if you have grasped the point about negative evidence,
here is a multiple choice historical question to test your
comprehension:

  There is no evidence that Richard Nixon was involved in the
  Watergate conspiracy.  This means that:

  a) Nixon was not involved in the Watergate conspiracy.

  b) All evidence that Nixon was involved in the Watergate
     conspiracy has been lost, destroyed, or suppressed.

  c) Either a or b could be true.

(Never mind ... rhetorical question. :>)

Bob Whiting
whiting at cc.helsinki.fi