[Corpora-List] Corpus size for lexicography

Mon Sep 30 23:32:07 UTC 2002

Dear Robert Amsler

I am concerned that your statements regarding 
corpus sizes for lexicographic purposes might be
*highly* misleading, at least for English:
> 1 million words we now know to be quite small 
> (adequate only for a Pocket Dictionary worth of entries).
> Collegiate dictionaries require at least a 10 million word corpus, and
> Unabridged dictionaries at least 100 million words (the target of the ANC).

1. From my experience while working for Cobuild at Birmingham University:

a) approx. half of the types/wordforms in most corpora have only one token (i.e. occur only once):
e.g. 213,684 out of 475,633 in the 121m corpus (1993); 438,647 out of 938,914 in the 418m corpus (2000).

b) dictionary entries cannot be based on one example; so let us say you need at least 10 examples
(a very modest figure; in fact, as our corpus has grown, and our software and understanding has 
become more sophisticated, the minimum threshold increases for some linguistic phenomena, as 
we find that we often require many more examples before particular features/patterns even 
become apparent, or certain statistics become reliable)

c) many types with 10+ tokens will not be included in most dictionaries (e.g. numerical entities, 
proper names, etc; some may be included in the dictionary, e.g. 24-7, the White House, etc, 
depending on editorial policy; the placement problem for numerical entities is a separate issue)

d) there are roughly 2.2 types per lemma (roughly equal to a dictionary headword) in English
(the lemma "be" has c. 18 types, including some archaic ones and contractions; most verbs have 
4 or 5 types; at the other end of the scale, many uncount nouns and adjectives, most adverbs and 
grammatical words, have only one type); of course some types may belong to lemmas, but will 
need to be treated as headwords in their own right, for sound lexicographic reasons.

2. Calculating potential dictionary headwords from corpus facts and figures:

a) In the 18m Cobuild corpus (1986), there were 43,579 types with 10+ tokens.
Dividing by 2.2, we get c. 19,800 lemmas with 10+ tokens, i.e. potential dictionary headwords

b) In the 120m Cobuild Bank of English corpus (1993), there were
99,326 types with 10+ tokens = c. 45,150 headwords

c) In the 450m Bank of English corpus (2001), there were
204,626 types with 10+ tokens = c. 93,000 headwords

I don't think the Cobuild corpora are untypical for such rough calculations.

3. Some dictionary figures:

It is difficult to gauge from dictionary publishers' marketing blurbs exactly how many headwords
are in their dictionaries, but here are a few figures taken from the Web today (unless otherwise stated).

a) Pocket:
Webster's New World Pocket = 37,000 entries

b) Collegiate:
New Shorter OED: 97,600 entries
Oxford Concise: 220,000 words, phrases and meanings
Webster's New World College: 160,000 entries
(cf Collins English Dictionary 1992: 180,000 references)

c) Unabridged:
OED: 500,000 entries
Random House Webster's Unabridged: 315,000 entries
(cf American Heritage 1992: 350,000 entries/meanings)

d) EFL Dictionaries
(cf Longman 1995: 80,000 words/phrases)
(cf Oxford 1995: 63,000 references)
(cf Cambridge 1995: 100,000 words/phrases)
(cf Cobuild 1995: 75,000 references)

4. So, by my reckoning, the 100m-word ANC corpus (yielding less than
45,000 potential headwords) will be adequate for a Pocket Dictionary, but 
will struggle to meet Collegiate requirements, and will be totally inadequate 
as the sole basis for an Unabridged Dictionary (if that really is the ANC's aim).

Surely we will need corpora in the billions of words range before we can start to compile 
truly corpus-based Unabridged dictionaries. Until then, corpora can assist us in most 
lexicographic and linguistic enterprises, but we cannot say that they are adequate in 
size. It is no coincidence that corpora first became used for EFL lexicography, where 
the requirement in number of headwords is more modest. But even here, it took much 
larger corpora to give us reliable evidence of the range of meanings, grammatical 
patterning and collocational behaviour of all but the most common words.

I have no wish to disillusion lexicographers working with smaller corpora. Cobuild's initial 
attempts in corpus lexicography entailed working with evidence from corpora of 1m and 
7m words. Many of those analyses remain valid in essence, even when checked in our 
450m word corpus. But we now have a better overview, and many more accurate details. 
Smaller corpora can be adequate for more restricted investigations, such as domain-specific 
dictionaries, local grammars, etc. But for robust generalizations about the entire lexicon, the 
bigger the corpus the better.

Best
Ramesh

Ramesh Krishnamurthy
Honorary Research Fellow, University of Birmingham;
Honorary Research Fellow, University of Wolverhampton;
Consultant, Cobuild and Bank of English Corpus, Collins Dictionaries.

----- Original Message -----
X-Server-Uuid: 0bf4d294-faec-11d1-a39a-0008c7246279
From: "Amsler, Robert" <Robert.Amsler at hq.doe.gov>
To: corpora at hd.uib.no
Subject: RE: [Corpora-List] ACL proceedings paper in the American
 National Corpus
X-WSS-ID: 11865F7E745332-01-02
X-checked-clean: by exiscan on alf
X-Scanner: 60e8f1512d716b649753d2ad49fb5c4a http://tjinfo.uib.no/virus.html
X-UiB-SpamFlag: NO UIB: -0.8 hits, 8 required;

There is clearly an issue here regarding what the American National Corpus
is trying to represent. The Brown Corpus tried to be "representative" by
extracting equal-sized samples selected from all the publications of a given
year. As has been found, it failed to adequately determine that all the
texts were created by American authors and alas, 1 million words we now know
to be quite small (adequate only for a Pocket Dictionary worth of entries).
Collegiate dictionaries require at least a 10 million word corpus, and
Unabridged dictionaries at least 100 million words (the target of the ANC).

However, what I detect to this point from ANC literature is that they are first trying to fill the quota of 100 million words and only secondarily concerned about "balancing" the corpus for genre and sample sizes.

Also, if I'm not mistaken, the Brown corpus didn't JUST balance for genres,
it tried to balance for timespan. I.e., it tried to form a closed universe
of possible publications and then representatively sample from that
universe.
This involves attempting to determine all the possible publications in that
universe and then selecting a subset which represents them in both quantity
and genre. While it may seem ambitious to first decide what is in the list
of all available publications (especially, if your criterion for inclusion
is merely "published after 1990"), it may be the only way to have a universe
from which a truly random sample can be extracted.

Note: Brown Corpus Manual http://www.hit.uib.no/icame/brown/bcm.html

Robert A. Amsler
robert.amsler at hq.doe.gov
(301) 903-8823