<!doctype html public "-//W3C//DTD W3 HTML//EN">

<html><head><style type="text/css"><!--

blockquote, dl, ul, ol, li { margin-top: 0 ; margin-bottom: 0 }

 --></style><title>On the size of the lexicon in preliterate

languages</title></head><body>

<div>Dear Jim (if I may)</div>

<div><br></div>

<div>I understand your concern to be getting an idea of the size of

the 'indigenous' lexicon in languages of preliterate societies.</div>

<div><br></div>

<div>I can tell you something about estimates based on the better

dictionaries for 'preliterate' languages of the Austronesian family

and the Trans New Guinea (the largest Papuan) family.</div>

<div><br></div>

<div>But first some methodological considerations. We can't make

useful comparisons without agreeing on the basic units to be counted.

Defining terms such as 'lexical unit' and 'lexeme' is, as you

indicate, crucial to estimating the size of the lexicon.</div>

<div><br></div>

<div>Like D.A. Cruse in his book<i> Lexical Semantics</i>, I regard

the basic lexical unit as the pairing of a form with a single sense. 

Just counting 'lexical entries' or 'headwords' is highly

unsatisfactory -- different dictionaries may organise entries on

radically different principles so that counts of entries or headwords

will not be commensurate. A polysemous root like<i> run</i>,<i>

take</i> or<i> head</i>  consists of many sense units and each

such unit has to be learnt separately.  A family of sense units

forms a lexeme. One can in turn recognise a family of lexemes (related

by derivation, compounding, etc.) which some dictionaries will include

in a single entry and others will not. </div>

<div><br></div>

<div>Given that the 10 most polysemous verb roots in English total 552

senses between them in the Macquarie Dictionary (many more in the OED,

but that includes obsolete senses), and the top 200 verb roots total

over 3000 senses, you can see that a count of sense units will yield a

much larger larger lexicon than a count of lexemes. Comparison is

further complicated by the fact that different languages seem to have

different amounts of polysemy. (It is true that there is some

fuzziness in boundaries between sense units but there are tests for

polysemy that work most of the time.)</div>

<div><br></div>

<div>There are other considerations. Just counting single-word lexical

units will result in an estimate that is far too low. In most,

probably all languages much of the lexicon consists of compounds and

phrasal units.  Estimating the size of the multi-word lexicon as

opposed to the single word lexicon can't be done by a simple general

formula because languages vary  considerably in how much use they

make of compounding and phrasal units.</div>

<div><br></div>

<div>Defining the boundary between inflection and derivation and

whether to count inflected forms is another issue. I think most of us

agree that we should not count regular inflected forms but we should

count irregular ones.  Another variable is the treatment of

dialect variants. Some dictionaries represent a single regional

dialect, others include material from a number of dialects.  And

so on.   </div>

<div><br></div>

<div>Anyway, my own experience of attempting to compile comprehensive

dictionaries is limited to one Austronesian language (Wayan Fijian)

and one Trans New Guinea language (Kalam). I've been toiling at both

for over 30 years, off and on. </div>

<div><br></div>

<div>Wayan is a dialect of the Western Fijian language spoken by a

farming and fishing community of about 1500 people.  The

Wayan-English dictionary (1000 pages) contains around 35,000 sense

units, of which probably not more than 3 percent would be loanwords

from non-Fijian languages. I haven't done a sampling of lexemes but at

a guess there are around 20 to 25,000. For sure, I have missed many

thousands of multiword units and probably some thousands of derived

words, as well as many foreign words and phrases that are more or less

integrated into Wayans' speech repertoires. </div>

<div><br></div>

<div>Kalam is spoken by a farming people on the fringes of the New

Guinea Highlands. At first European contact (in the 1950s and 60s)

there were about 13,000 Kalam, though these divided into several

regional dialects.  The Kalam-English dictionary is smaller than

the Wayan one, containing about 15,000 sense units. Why is it smaller?

Mainly I think because Kalam doesn't have such a rich verbal

derivational system as Wayan and because, unlike Wayan, it cannot

derive verb roots from nouns and vice versa.</div>

<div><br></div>

<div>In her 1998 PhD thesis on problems in Tongan lexicography

Melenaite Taumoefolau made counts of the number of entries in the

largest dictionaries of Polynesian languages (Maori, Hawaiian, Tongan,

Samoan). As I recall it, these ranged from 19,000 to 23,000. These

figures don't tell us the number of basic lexical units (in my sense)

but they indicate that these four dictionaries probably each contains

on the order of 30 to 50,000 lexical units.</div>

<div><br></div>

<div>All of which suggests that your historical linguist friends who

said 50,000 were talking more sense (no pun intended) than those

talking 3000.</div>

<div><br></div>

<div>Of some interest are the inventories for specialised semantic

domains. Kalam has over 1200 terms for plant taxa, Wayan has 600-700.

The Kalam have a richer flora (Waya is a small island) and make wider

use of it than contemporary Wayans, who are more westernised. 

Comparative ethnobotanical data indicate that preliterate language

communities generally have over 1000 terms for plants, provided they

live in a place with a rich flora.  The Wayans exploit a rich

marine environment and distinguish over 400 fish taxa, 140 mollusc

taxa and about 40 crustacean taxa. Other studies show that Pacific

Island fishing communities consistently distinguish well over 300 fish

taxa, except for small very remote islands where there are fewer

fish.  The Kalam on the other hand are great on land animals and

distinguish some 230 bird taxa, over 40 mammals (mainly marsupials),

35 frogs and over 100 creepy crawly taxa. I would expect other New

Guinea Highland peoples to pattern pretty much like Kalam.</div>

<div><br></div>

<div>I'll post this note on the Austronesian Languages and Papuan

Languages lists to see if any of my colleagues there have

opinions.</div>

<div><br></div>

<div>Andy Pawley</div>

<div>Linguistics Dept, RSPAS</div>

<div>Australian National University  </div>

<div><br></div>

<div><br></div>

<div><br></div>

<div>  </div>

<div><br></div>

<blockquote type="cite" cite>Malcolm (or whomever is taking this)<br>

<br>

For some time I have been trying to establish ball park figures for

the size of the lexicon of unwritten languages, i.e. languages that

will not be full of learned European loans etc. and I have been

getting estimates from historical linguists that range beyond a single

order of magnitude (3,000 to 50,000). If there is a reliable source

out there that covers such could you let me know. Otherwise, could

this be asked around. I do appreciate how difficult this is to

estimate especially given the problem of defining lexemes but some

form of general order of magnitude would be useful.<br>

<br>

Jim<br>

<br>

_______________________________________________<br>

Arcling mailing list<br>

Arcling@anu.edu.au<br>

http://mailman.anu.edu.au/mailman/listinfo/arcling</blockquote>

<div><br></div>

</body>

</html>