Linguistic dark matter

Sat Dec 18 00:46:56 UTC 2010

Thanks, John; makes it understandable.  Their usage is not too
terrible, except for "decile" to mean "order of magnitude."  And now
I understand how there can be eight "deciles"!

I'm startled to see the number 10**9 (10 to the 9th power; and I note
that the double asterisk is a standard form when superscripts are not
used--but Google-ites don't know standards) -- an almost astronomical
range of values.  But I suppose with a "corpus contain[ing]
386,434,758 [0.386 x 10**9] words from 1861" alone, a frequency range
of 10**9 can be interesting.

Joel

At 12/17/2010 02:56 PM, Baker, John wrote:
>         Perhaps I should have included the following passage, which
>explains how they calculated usage frequency.  Also, it should be
>understood that the number after a 10 (that is, the -9, -8, and so
>forth) is in superscript in the original, but my ASCII post converted
>this to normal text.  In the following passage, therefore, "10-5" is
>really 10 to the power of -5.  The authors apparently are using "decile"
>to mean "order of magnitude."  Since the powers of 10 are negative
>powers, a usage frequency of 10-8 would be for a low-frequency word and
>10-2 would be for a high-frequency word.
>
>
>         <<Usage frequency is computed by dividing the number of
>instances of the n-gram in a given year by the total number of words in
>the corpus in that year.  For instance, in 1861, the 1-gram "slavery"
>appeared in the corpus 21,460 times, on 11,687 pages of 1,208 books.
>The corpus contains 386,434,758 words from 1861; thus the frequency is
>5.5x10-5.>>
>
>
>John Baker
>
>
>
>-----Original Message-----
>From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf
>Of Joel S. Berson
>Sent: Friday, December 17, 2010 2:07 PM
>To: ADS-L at LISTSERV.UGA.EDU
>Subject: Re: Linguistic dark matter
>
>1)  The researchers "estimate the number of English words", but then
>compare their corpus only with dictionaries of American, having
>"348,000 and 116,161 "single-word wordforms".  How many does the OED
>have>
>
>2)   I once was a mathematician, but I find myself mightily confused
>by the "eight deciles (ranging from 10-9 - 10-8 to 10-2 -
>10-1)".  I've never seen such a notation for deciles, which are
>usually referred to as "the 90th percentile" or similarly.  What does
>the notation in the "ranging ..." explanation mean?  The "eight" I
>might understand as omitting two from their study, but which two, and
>why?
>       And which end is up (high frequency) vs. down (low
>frequency)?  They write "Both dictionaries had excellent coverage of
>high frequency words, but less coverage for frequencies below 10- 6:
>67% of words in the 10-9 - 10-8 range were listed in neither
>dictionary (Fig. 2B)."  "Frequencies below 10- 6" suggests that the
>smaller numbers are the lower frequencies, but to refer to the "10-9
>- 10-8 range" as the "but less coverage" confuses me.
>
>Joel
>
>At 12/17/2010 11:24 AM, Baker, John wrote:
> >         It seems extremely plausible to me that a high percentage of
> >words is undocumented in standard references.  Many words are obscure
> >and rarely used; lexicographers rightly place reasonably high standards
> >on what words they will include in their dictionaries.
> >
> >         Here is the relevant portion of the article:
> >
> >
> >         <<How many words are in the English language (9)?
> >         We call a 1-gram [i.e., a string of characters uninterrupted
>by
> >a space, such as "banana" or "3.14159") "common" if its frequency is
> >greater
> >than one per billion. (This corresponds to the frequency of the
> >words listed in leading dictionaries (7).) We compiled a list of
> >all common 1-grams in 1900, 1950, and 2000 based on the
> >frequency of each 1-gram in the preceding decade. These lists
> >contained 1,117,997 common 1-grams in 1900, 1,102,920 in
> >1950, and 1,489,337 in 2000.
> >         Not all common 1-grams are English words. Many fell
> >into three non-word categories: (i) 1-grams with nonalphabetic
> >characters ("l8r", "3.14159"); (ii) misspellings
> >("becuase, "abberation"); and (iii) foreign words
> >("sensitivo").
> >         To estimate the number of English words, we manually
> >annotated random samples from the lists of common 1-grams
> >(7) and determined what fraction were members of the above
> >non-word categories. The result ranged from 51% of all
> >common 1-grams in 1900 to 31% in 2000.
> >         Using this technique, we estimated the number of words in
> >the English lexicon as 544,000 in 1900, 597,000 in 1950, and
> >1,022,000 in 2000. The lexicon is enjoying a period of
> >enormous growth: the addition of ~8500 words/year has
> >increased the size of the language by over 70% during the last
> >fifty years (Fig. 2A).
> >         Notably, we found more words than appear in any
> >dictionary. For instance, the 2002 Webster's Third New
> >International Dictionary [W3], which keeps track of the
> >contemporary American lexicon, lists approximately 348,000
> >single-word wordforms (10); the American Heritage
> >Dictionary of the English Language, Fourth Edition (AHD4)
> >lists 116,161 (11). (Both contain additional multi-word
> >entries.) Part of this gap is because dictionaries often exclude
> >proper nouns and compound words ("whalewatching"). Even
> >accounting for these factors, we found many undocumented
> >words, such as "aridification" (the process by which a
> >geographic region becomes dry), "slenthem" (a musical
> >instrument), and, appropriately, the word "deletable."
> >         This gap between dictionaries and the lexicon results from
> >a balance that every dictionary must strike: it must be
> >comprehensive enough to be a useful reference, but concise
> >enough to be printed, shipped, and used. As such, many
> >infrequent words are omitted. To gauge how well dictionaries
> >reflect the lexicon, we ordered our year 2000 lexicon by
> >frequency, divided it into eight deciles (ranging from 10-9 -
> >10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
> >checked how many sample words were listed in the OED (12)
> >and in the Merriam-Webster Unabridged Dictionary [MWD].
> >(We excluded proper nouns, since neither OED nor MWD
> >lists them.) Both dictionaries had excellent coverage of high
> >frequency words, but less coverage for frequencies below 10-
> >6: 67% of words in the 10-9 - 10-8 range were listed in neither
> >dictionary (Fig. 2B). Consistent with Zipf's famous law, a
> >large fraction of the words in our lexicon (63%) were in this
> >lowest frequency bin. As a result, we estimated that 52% of
> >the English lexicon - the majority of the words used in
> >English books - consists of lexical "dark matter"
> >undocumented in standard references (12).>>
> >
> >
> >         In the online supporting materials, the authors mention that
>the
> >dictionaries' performance was boosted appreciably by the inclusion of
> >Merriam-Webster's Medical Dictionary (they used the online versions of
> >MW and the OED).  Presumably other specialized dictionaries would have
> >further aided performance.
> >
> >
> >John Baker
> >
> >
> >-----Original Message-----
> >From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On
>Behalf
> >Of Ben Zimmer
> >Sent: Friday, December 17, 2010 10:15 AM
> >To: ADS-L at LISTSERV.UGA.EDU
> >Subject: Re: Linguistic dark matter
> >
> >On Fri, Dec 17, 2010 at 9:14 AM, Michael Quinion
> ><wordseditor at worldwidewords.org> wrote:
> > >
> > > David Barnhart wrote
> > >
> > > > If you haven't noticed I'm skeptical of the "tool".
> > >
> > > I'm certainly sceptical of that 52% "undocumented in standard
> >references",
> > > which was why I quoted that sentence. The figure seems extremely
>high.
> >As
> > > I can't get access to the Science article (which is only fee online
>to
> > > subscribers), I can't begin to work out its basis.
> > >
> > > The researchers seem not to have applied many lexical filters.
>Proper
> > > names are included, because they want the corpus to be a cultural
>tool
> >as
> > > well as a lexicographical one. Similarly, they allow scientific
>names
> > > ("Turdus merula" and the like). I would have thought that - if the
> > > "standard references" are restricted to general dictionaries -
>proper
> >and
> > > scientific names would account for a big part of that missing 52%.
> >
> >To be fair, proper nouns were included in the researchers' overall
> >lexical count, but the "dark matter" is not 52% of that number. They
> >did filter out proper nouns of that part of the analysis, since they
> >were going for an apples-to-apples comparison with the OED and
> >Webster's Third. The media coverage doesn't get into these subtleties,
> >of course.
> >
> >--bgz
> >
> >--
> >Ben Zimmer
> >http://benzimmer.com/
> >
> >------------------------------------------------------------
> >The American Dialect Society - http://www.americandialect.org
>
>------------------------------------------------------------
>The American Dialect Society - http://www.americandialect.org
>
>------------------------------------------------------------
>The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org