Fw: [Lexicog] Word lists for big dictionaries

Mon Mar 15 11:06:13 UTC 2004

Mery --

You said: 

"1) To create a big general bilingual dictionary  should we start from the number of lemmas which are most frequent in the wordlist extracted from our corpus and then increase or decrease their quantity according to the space available in the dictionary, or should we start from the number of entries generally needed in a dictionary of that size and than extract from the corpus the needed number of most frequent corpus lemmas?"

"Most frequent .. extracted from the corpus" - there's a misconception here, I suspect...

The question is, is the corpus word list bigger than the dictionary, or vice versa?

My previous message commented on small dictionaries.  For a big dictionary -- even a one-volume big dictionary, such as the (New) Oxford Dictionary of English -- a corpus alone is not enough. After throwing out names and junk (e.g. strings of letters that are not words at all) we included ALL the words and senses in the corpus, AND THEN SOME MORE.  

What more?  Well, here are some examples:

1. Some of the words in NODE that are not in the BNC are words of historical importance - used by Shakespeare, for example, but since obsolete. 

2. Other entries are names of plants and animals, for which a systematic survey of the literature was carried out by my colleagues David Shirt and Bill Trumble, making constantly difficult decisions about whether the local name for some non-European plant/animal should be included in the dictionary, to reflect the "global" nature of modern English. 

3. Sometimes rather rare scientific terms are included for consistency of coverage.  For example, no one would disagree with the decision to include "carbon" and "hydrogen", but for consistency of sets, we included ALL the chemical elements, even the rarest ones. A similar principal was applied to many other fields.  I myself wrote the entries for languages and peoples, using the Oxford Encyclopedia of Language and Linguistics as a guide, but making only a selection, and trying (no doubt failing!) to be consistent. I included many names of languages and people that do not occur in any BNC text. Conversely, it is theoretically possible (at this date, I can't remember an actual case) that BNC may have several mentions of some extremely rare language which happened to be in the news in 1991-3 but which did not make it into NODE. A language or people that hit the headlines briefly in 1991-3 -- for example in a story or review of langauge death -- would not necessarily merit inclusion in a 1998 dictionary. 

4. Other entries are new words or new senses discovered by the Oxford Reading Program.   An example that comes to mind is the use of "dope" in Black American English as an adjective of approbation  -- " Man, that suit is dope".   Not surprisingly, this sense is not in the British National Corpus.  Another example from the same register: "hood" and "burbs". I'm sad to say that in NODE we pulled our punches when defining these two, failing to explain the connotations adequately. Somewhere I have a wonderful (recent) citation about a rowdy rapper who was fired by his recording company for "bringing the hood to the burbs".  But I digress. 

Thus, to create the word list for NODE, we had at least three techniques: corpus (which helped us to shape the entries for all the common words), literature survey of special fields, and citations for new words and senses -- often informal in register -- from the reading program.  The Oxford view is that there is no substitute for a reading program. Well, we were 

Something similar happened on the first edition of Collins English Dictionary (1979). In those days, we did not have a corpus at all. CED supplemented the basic word list with a systematic survey of the literature (mostly course books) in many different specialist fields. 

* * *

I don't know about bilingual lexicography, but I suspect that, for dictionary-rich languages, a reasonable starting point would be a comparison of the word lists in native-speaker dictionaries in the two languages.  Comment, anyone?

Patrick (again). 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20040315/ae7af416/attachment.htm>