Linguistic dark matter

Fri Dec 17 16:24:08 UTC 2010

        It seems extremely plausible to me that a high percentage of
words is undocumented in standard references.  Many words are obscure
and rarely used; lexicographers rightly place reasonably high standards
on what words they will include in their dictionaries.

        Here is the relevant portion of the article:

        <<How many words are in the English language (9)?
        We call a 1-gram [i.e., a string of characters uninterrupted by
a space, such as "banana" or "3.14159") "common" if its frequency is
greater
than one per billion. (This corresponds to the frequency of the
words listed in leading dictionaries (7).) We compiled a list of
all common 1-grams in 1900, 1950, and 2000 based on the
frequency of each 1-gram in the preceding decade. These lists
contained 1,117,997 common 1-grams in 1900, 1,102,920 in
1950, and 1,489,337 in 2000.
        Not all common 1-grams are English words. Many fell
into three non-word categories: (i) 1-grams with nonalphabetic
characters ("l8r", "3.14159"); (ii) misspellings
("becuase, "abberation"); and (iii) foreign words
("sensitivo").
        To estimate the number of English words, we manually
annotated random samples from the lists of common 1-grams
(7) and determined what fraction were members of the above
non-word categories. The result ranged from 51% of all
common 1-grams in 1900 to 31% in 2000.
        Using this technique, we estimated the number of words in
the English lexicon as 544,000 in 1900, 597,000 in 1950, and
1,022,000 in 2000. The lexicon is enjoying a period of
enormous growth: the addition of ~8500 words/year has
increased the size of the language by over 70% during the last
fifty years (Fig. 2A).
        Notably, we found more words than appear in any
dictionary. For instance, the 2002 Webster's Third New
International Dictionary [W3], which keeps track of the
contemporary American lexicon, lists approximately 348,000
single-word wordforms (10); the American Heritage
Dictionary of the English Language, Fourth Edition (AHD4)
lists 116,161 (11). (Both contain additional multi-word
entries.) Part of this gap is because dictionaries often exclude
proper nouns and compound words ("whalewatching"). Even
accounting for these factors, we found many undocumented
words, such as "aridification" (the process by which a
geographic region becomes dry), "slenthem" (a musical
instrument), and, appropriately, the word "deletable."
        This gap between dictionaries and the lexicon results from
a balance that every dictionary must strike: it must be
comprehensive enough to be a useful reference, but concise
enough to be printed, shipped, and used. As such, many
infrequent words are omitted. To gauge how well dictionaries
reflect the lexicon, we ordered our year 2000 lexicon by
frequency, divided it into eight deciles (ranging from 10-9 -
10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
checked how many sample words were listed in the OED (12)
and in the Merriam-Webster Unabridged Dictionary [MWD].
(We excluded proper nouns, since neither OED nor MWD
lists them.) Both dictionaries had excellent coverage of high
frequency words, but less coverage for frequencies below 10-
6: 67% of words in the 10-9 - 10-8 range were listed in neither
dictionary (Fig. 2B). Consistent with Zipf's famous law, a
large fraction of the words in our lexicon (63%) were in this
lowest frequency bin. As a result, we estimated that 52% of
the English lexicon - the majority of the words used in
English books - consists of lexical "dark matter"
undocumented in standard references (12).>>

        In the online supporting materials, the authors mention that the
dictionaries' performance was boosted appreciably by the inclusion of
Merriam-Webster's Medical Dictionary (they used the online versions of
MW and the OED).  Presumably other specialized dictionaries would have
further aided performance.

John Baker

-----Original Message-----
From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf
Of Ben Zimmer
Sent: Friday, December 17, 2010 10:15 AM
To: ADS-L at LISTSERV.UGA.EDU
Subject: Re: Linguistic dark matter

On Fri, Dec 17, 2010 at 9:14 AM, Michael Quinion
<wordseditor at worldwidewords.org> wrote:
>
> David Barnhart wrote
>
> > If you haven't noticed I'm skeptical of the "tool".
>
> I'm certainly sceptical of that 52% "undocumented in standard
references",
> which was why I quoted that sentence. The figure seems extremely high.
As
> I can't get access to the Science article (which is only fee online to
> subscribers), I can't begin to work out its basis.
>
> The researchers seem not to have applied many lexical filters. Proper
> names are included, because they want the corpus to be a cultural tool
as
> well as a lexicographical one. Similarly, they allow scientific names
> ("Turdus merula" and the like). I would have thought that - if the
> "standard references" are restricted to general dictionaries - proper
and
> scientific names would account for a big part of that missing 52%.

To be fair, proper nouns were included in the researchers' overall
lexical count, but the "dark matter" is not 52% of that number. They
did filter out proper nouns of that part of the analysis, since they
were going for an apples-to-apples comparison with the OED and
Webster's Third. The media coverage doesn't get into these subtleties,
of course.

--bgz

--
Ben Zimmer
http://benzimmer.com/

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org