Linguistic dark matter

Fri Dec 17 19:56:20 UTC 2010

        Perhaps I should have included the following passage, which
explains how they calculated usage frequency.  Also, it should be
understood that the number after a 10 (that is, the -9, -8, and so
forth) is in superscript in the original, but my ASCII post converted
this to normal text.  In the following passage, therefore, "10-5" is
really 10 to the power of -5.  The authors apparently are using "decile"
to mean "order of magnitude."  Since the powers of 10 are negative
powers, a usage frequency of 10-8 would be for a low-frequency word and
10-2 would be for a high-frequency word.

        <<Usage frequency is computed by dividing the number of
instances of the n-gram in a given year by the total number of words in
the corpus in that year.  For instance, in 1861, the 1-gram "slavery"
appeared in the corpus 21,460 times, on 11,687 pages of 1,208 books.
The corpus contains 386,434,758 words from 1861; thus the frequency is

John Baker

1)  The researchers "estimate the number of English words", but then
compare their corpus only with dictionaries of American, having
"348,000 and 116,161 "single-word wordforms".  How many does the OED

2)   I once was a mathematician, but I find myself mightily confused
by the "eight deciles (ranging from 10-9 - 10-8 to 10-2 -
10-1)".  I've never seen such a notation for deciles, which are
usually referred to as "the 90th percentile" or similarly.  What does
the notation in the "ranging ..." explanation mean?  The "eight" I
might understand as omitting two from their study, but which two, and
      And which end is up (high frequency) vs. down (low
frequency)?  They write "Both dictionaries had excellent coverage of
high frequency words, but less coverage for frequencies below 10- 6:
67% of words in the 10-9 - 10-8 range were listed in neither
dictionary (Fig. 2B)."  "Frequencies below 10- 6" suggests that the
smaller numbers are the lower frequencies, but to refer to the "10-9
- 10-8 range" as the "but less coverage" confuses me.


>         It seems extremely plausible to me that a high percentage of
>words is undocumented in standard references.  Many words are obscure
>and rarely used; lexicographers rightly place reasonably high standards
>on what words they will include in their dictionaries.
>         Here is the relevant portion of the article:
>         <<How many words are in the English language (9)?
>         We call a 1-gram [i.e., a string of characters uninterrupted
>a space, such as "banana" or "3.14159") "common" if its frequency is
>than one per billion. (This corresponds to the frequency of the
>words listed in leading dictionaries (7).) We compiled a list of
>all common 1-grams in 1900, 1950, and 2000 based on the
>frequency of each 1-gram in the preceding decade. These lists
>contained 1,117,997 common 1-grams in 1900, 1,102,920 in
>1950, and 1,489,337 in 2000.
>         Not all common 1-grams are English words. Many fell
>into three non-word categories: (i) 1-grams with nonalphabetic
>characters ("l8r", "3.14159"); (ii) misspellings
>("becuase, "abberation"); and (iii) foreign words
>         To estimate the number of English words, we manually
>annotated random samples from the lists of common 1-grams
>(7) and determined what fraction were members of the above
>non-word categories. The result ranged from 51% of all
>common 1-grams in 1900 to 31% in 2000.
>         Using this technique, we estimated the number of words in
>the English lexicon as 544,000 in 1900, 597,000 in 1950, and
>1,022,000 in 2000. The lexicon is enjoying a period of
>enormous growth: the addition of ~8500 words/year has
>increased the size of the language by over 70% during the last
>fifty years (Fig. 2A).
>         Notably, we found more words than appear in any
>dictionary. For instance, the 2002 Webster's Third New
>International Dictionary [W3], which keeps track of the
>contemporary American lexicon, lists approximately 348,000
>single-word wordforms (10); the American Heritage
>Dictionary of the English Language, Fourth Edition (AHD4)
>lists 116,161 (11). (Both contain additional multi-word
>entries.) Part of this gap is because dictionaries often exclude
>proper nouns and compound words ("whalewatching"). Even
>accounting for these factors, we found many undocumented
>words, such as "aridification" (the process by which a
>geographic region becomes dry), "slenthem" (a musical
>instrument), and, appropriately, the word "deletable."
>         This gap between dictionaries and the lexicon results from
>a balance that every dictionary must strike: it must be
>comprehensive enough to be a useful reference, but concise
>enough to be printed, shipped, and used. As such, many
>infrequent words are omitted. To gauge how well dictionaries
>reflect the lexicon, we ordered our year 2000 lexicon by
>frequency, divided it into eight deciles (ranging from 10-9 -
>10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
>checked how many sample words were listed in the OED (12)
>and in the Merriam-Webster Unabridged Dictionary [MWD].
>(We excluded proper nouns, since neither OED nor MWD
>lists them.) Both dictionaries had excellent coverage of high
>frequency words, but less coverage for frequencies below 10-
>6: 67% of words in the 10-9 - 10-8 range were listed in neither
>dictionary (Fig. 2B). Consistent with Zipf's famous law, a
>large fraction of the words in our lexicon (63%) were in this
>lowest frequency bin. As a result, we estimated that 52% of
>the English lexicon - the majority of the words used in
>English books - consists of lexical "dark matter"
>undocumented in standard references (12).>>
>         In the online supporting materials, the authors mention that
>dictionaries' performance was boosted appreciably by the inclusion of
>Merriam-Webster's Medical Dictionary (they used the online versions of
>MW and the OED).  Presumably other specialized dictionaries would have
>further aided performance.
> >
> > David Barnhart wrote
> >
> > > If you haven't noticed I'm skeptical of the "tool".
> >
> > I'm certainly sceptical of that 52% "undocumented in standard
> > which was why I quoted that sentence. The figure seems extremely
> > I can't get access to the Science article (which is only fee online
> > subscribers), I can't begin to work out its basis.
> >
> > The researchers seem not to have applied many lexical filters.
> > names are included, because they want the corpus to be a cultural
> > well as a lexicographical one. Similarly, they allow scientific
> > ("Turdus merula" and the like). I would have thought that - if the
> > "standard references" are restricted to general dictionaries -
> > scientific names would account for a big part of that missing 52%.
>To be fair, proper nouns were included in the researchers' overall
>lexical count, but the "dark matter" is not 52% of that number. They
>did filter out proper nouns of that part of the analysis, since they
>were going for an apples-to-apples comparison with the OED and
>Webster's Third. The media coverage doesn't get into these subtleties,
>of course.
Ben Zimmer
