[An-lang] On the size of the lexicon in preliterate languages

Andy Pawley apawley at coombs.anu.edu.au
Wed Jun 11 03:50:14 UTC 2003

Dear Jim (if I may)

I understand your concern to be getting an idea of the size of the
'indigenous' lexicon in languages of preliterate societies.

I can tell you something about estimates based on the better
dictionaries for 'preliterate' languages of the Austronesian family
and the Trans New Guinea (the largest Papuan) family.

But first some methodological considerations. We can't make useful
comparisons without agreeing on the basic units to be counted.
Defining terms such as 'lexical unit' and 'lexeme' is, as you
indicate, crucial to estimating the size of the lexicon.

Like D.A. Cruse in his book Lexical Semantics, I regard the basic
lexical unit as the pairing of a form with a single sense.  Just
counting 'lexical entries' or 'headwords' is highly unsatisfactory --
different dictionaries may organise entries on radically different
principles so that counts of entries or headwords will not be
commensurate. A polysemous root like run, take or head  consists of
many sense units and each such unit has to be learnt separately.  A
family of sense units forms a lexeme. One can in turn recognise a
family of lexemes (related by derivation, compounding, etc.) which
some dictionaries will include in a single entry and others will not.

Given that the 10 most polysemous verb roots in English total 552
senses between them in the Macquarie Dictionary (many more in the
OED, but that includes obsolete senses), and the top 200 verb roots
total over 3000 senses, you can see that a count of sense units will
yield a much larger larger lexicon than a count of lexemes.
Comparison is further complicated by the fact that different
languages seem to have different amounts of polysemy. (It is true
that there is some fuzziness in boundaries between sense units but
there are tests for polysemy that work most of the time.)

There are other considerations. Just counting single-word lexical
units will result in an estimate that is far too low. In most,
probably all languages much of the lexicon consists of compounds and
phrasal units.  Estimating the size of the multi-word lexicon as
opposed to the single word lexicon can't be done by a simple general
formula because languages vary  considerably in how much use they
make of compounding and phrasal units.

Defining the boundary between inflection and derivation and whether
to count inflected forms is another issue. I think most of us agree
that we should not count regular inflected forms but we should count
irregular ones.  Another variable is the treatment of dialect
variants. Some dictionaries represent a single regional dialect,
others include material from a number of dialects.  And so on.

Anyway, my own experience of attempting to compile comprehensive
dictionaries is limited to one Austronesian language (Wayan Fijian)
and one Trans New Guinea language (Kalam). I've been toiling at both
for over 30 years, off and on.

Wayan is a dialect of the Western Fijian language spoken by a farming
and fishing community of about 1500 people.  The Wayan-English
dictionary (1000 pages) contains around 35,000 sense units, of which
probably not more than 3 percent would be loanwords from non-Fijian
languages. I haven't done a sampling of lexemes but at a guess there
are around 20 to 25,000. For sure, I have missed many thousands of
multiword units and probably some thousands of derived words, as well
as many foreign words and phrases that are more or less integrated
into Wayans' speech repertoires.

Kalam is spoken by a farming people on the fringes of the New Guinea
Highlands. At first European contact (in the 1950s and 60s) there
were about 13,000 Kalam, though these divided into several regional
dialects.  The Kalam-English dictionary is smaller than the Wayan
one, containing about 15,000 sense units. Why is it smaller? Mainly I
think because Kalam doesn't have such a rich verbal derivational
system as Wayan and because, unlike Wayan, it cannot derive verb
roots from nouns and vice versa.

In her 1998 PhD thesis on problems in Tongan lexicography Melenaite
Taumoefolau made counts of the number of entries in the largest
dictionaries of Polynesian languages (Maori, Hawaiian, Tongan,
Samoan). As I recall it, these ranged from 19,000 to 23,000. These
figures don't tell us the number of basic lexical units (in my sense)
but they indicate that these four dictionaries probably each contains
on the order of 30 to 50,000 lexical units.

All of which suggests that your historical linguist friends who said
50,000 were talking more sense (no pun intended) than those talking

Of some interest are the inventories for specialised semantic
domains. Kalam has over 1200 terms for plant taxa, Wayan has 600-700.
The Kalam have a richer flora (Waya is a small island) and make wider
use of it than contemporary Wayans, who are more westernised.
Comparative ethnobotanical data indicate that preliterate language
communities generally have over 1000 terms for plants, provided they
live in a place with a rich flora.  The Wayans exploit a rich marine
environment and distinguish over 400 fish taxa, 140 mollusc taxa and
about 40 crustacean taxa. Other studies show that Pacific Island
fishing communities consistently distinguish well over 300 fish taxa,
except for small very remote islands where there are fewer fish.  The
Kalam on the other hand are great on land animals and distinguish
some 230 bird taxa, over 40 mammals (mainly marsupials), 35 frogs and
over 100 creepy crawly taxa. I would expect other New Guinea Highland
peoples to pattern pretty much like Kalam.

I'll post this note on the Austronesian Languages and Papuan
Languages lists to see if any of my colleagues there have opinions.

Andy Pawley
Linguistics Dept, RSPAS
Australian National University

>Malcolm (or whomever is taking this)
>For some time I have been trying to establish ball park figures for
>the size of the lexicon of unwritten languages, i.e. languages that
>will not be full of learned European loans etc. and I have been
>getting estimates from historical linguists that range beyond a
>single order of magnitude (3,000 to 50,000). If there is a reliable
>source out there that covers such could you let me know. Otherwise,
>could this be asked around. I do appreciate how difficult this is to
>estimate especially given the problem of defining lexemes but some
>form of general order of magnitude would be useful.
>Arcling mailing list
>Arcling at anu.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/an-lang/attachments/20030611/b963a7c1/attachment.htm>
-------------- next part --------------
An-lang mailing list
An-lang at anu.edu.au

More information about the An-lang mailing list