Linguistic dark matter

Jonathan Lighter wuxxmupp2000 at GMAIL.COM
Fri Dec 17 16:38:15 UTC 2010

And How many of those "dark" words are unnaturalized foreign words in
English contexts?

Of course, I imagine their definition of "word" is airtight. Frequent
two-word compounds, for example, many of which require no dictiomnary entry,
can be multiplied almost ad lib. Probably the original article
addresses such issues, but at this point I'd be less than honest if I said I


>        It seems extremely plausible to me that a high percentage of
> words is undocumented in standard references.  Many words are obscure
> and rarely used; lexicographers rightly place reasonably high standards
> on what words they will include in their dictionaries.
>        Here is the relevant portion of the article:
>        <<How many words are in the English language (9)?
>        We call a 1-gram [i.e., a string of characters uninterrupted by
> a space, such as "banana" or "3.14159") "common" if its frequency is
> greater
> than one per billion. (This corresponds to the frequency of the
> words listed in leading dictionaries (7).) We compiled a list of
> all common 1-grams in 1900, 1950, and 2000 based on the
> frequency of each 1-gram in the preceding decade. These lists
> contained 1,117,997 common 1-grams in 1900, 1,102,920 in
> 1950, and 1,489,337 in 2000.
>        Not all common 1-grams are English words. Many fell
> into three non-word categories: (i) 1-grams with nonalphabetic
> characters ("l8r", "3.14159"); (ii) misspellings
> ("becuase, "abberation"); and (iii) foreign words
> ("sensitivo").
>        To estimate the number of English words, we manually
> annotated random samples from the lists of common 1-grams
> (7) and determined what fraction were members of the above
> non-word categories. The result ranged from 51% of all
> common 1-grams in 1900 to 31% in 2000.
>        Using this technique, we estimated the number of words in
> the English lexicon as 544,000 in 1900, 597,000 in 1950, and
> 1,022,000 in 2000. The lexicon is enjoying a period of
> enormous growth: the addition of ~8500 words/year has
> increased the size of the language by over 70% during the last
> fifty years (Fig. 2A).
>        Notably, we found more words than appear in any
> dictionary. For instance, the 2002 Webster's Third New
> International Dictionary [W3], which keeps track of the
> contemporary American lexicon, lists approximately 348,000
> single-word wordforms (10); the American Heritage
> Dictionary of the English Language, Fourth Edition (AHD4)
> lists 116,161 (11). (Both contain additional multi-word
> entries.) Part of this gap is because dictionaries often exclude
> proper nouns and compound words ("whalewatching"). Even
> accounting for these factors, we found many undocumented
> words, such as "aridification" (the process by which a
> geographic region becomes dry), "slenthem" (a musical
> instrument), and, appropriately, the word "deletable."
>        This gap between dictionaries and the lexicon results from
> a balance that every dictionary must strike: it must be
> comprehensive enough to be a useful reference, but concise
> enough to be printed, shipped, and used. As such, many
> infrequent words are omitted. To gauge how well dictionaries
> reflect the lexicon, we ordered our year 2000 lexicon by
> frequency, divided it into eight deciles (ranging from 10-9 -
> 10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
> checked how many sample words were listed in the OED (12)
> and in the Merriam-Webster Unabridged Dictionary [MWD].
> (We excluded proper nouns, since neither OED nor MWD
> lists them.) Both dictionaries had excellent coverage of high
> frequency words, but less coverage for frequencies below 10-
> 6: 67% of words in the 10-9 - 10-8 range were listed in neither
> dictionary (Fig. 2B). Consistent with Zipf's famous law, a
> large fraction of the words in our lexicon (63%) were in this
> lowest frequency bin. As a result, we estimated that 52% of
> the English lexicon - the majority of the words used in
> English books - consists of lexical "dark matter"
> undocumented in standard references (12).>>
>        In the online supporting materials, the authors mention that the
> dictionaries' performance was boosted appreciably by the inclusion of
> Merriam-Webster's Medical Dictionary (they used the online versions of
> MW and the OED).  Presumably other specialized dictionaries would have
> further aided performance.
> John Baker
> >
> > David Barnhart wrote
> >
> > > If you haven't noticed I'm skeptical of the "tool".
> >
> > I'm certainly sceptical of that 52% "undocumented in standard
> references",
> > which was why I quoted that sentence. The figure seems extremely high.
> As
> > I can't get access to the Science article (which is only fee online to
> > subscribers), I can't begin to work out its basis.
> >
> > The researchers seem not to have applied many lexical filters. Proper
> > names are included, because they want the corpus to be a cultural tool
> as
> > well as a lexicographical one. Similarly, they allow scientific names
> > ("Turdus merula" and the like). I would have thought that - if the
> > "standard references" are restricted to general dictionaries - proper
> and
> > scientific names would account for a big part of that missing 52%.
> To be fair, proper nouns were included in the researchers' overall
> lexical count, but the "dark matter" is not 52% of that number. They
> did filter out proper nouns of that part of the analysis, since they
> were going for an apples-to-apples comparison with the OED and
> Webster's Third. The media coverage doesn't get into these subtleties,
> of course.
> --bgz
"If the truth is half as bad as I think it is, you can't handle the truth."

