Linguistic dark matter

Dan Goncharoff thegonch at GMAIL.COM
Fri Dec 17 19:43:56 UTC 2010


What they wrote was this:

"To gauge how well dictionaries reflect the lexicon, we ordered our
year 2000 lexicon by frequency, divided it into eight deciles (ranging
from 10-9 - 10-8 to 10-2 - 10-1), and sampled each decile."

I can't get past "divided it into eight deciles". Your attempt to give
them the benefit of the doubt is admirable, but unpersuasive.

DanG

On Fri, Dec 17, 2010 at 2:07 PM, Joel S. Berson <Berson at att.net> wrote:
> ---------------------- Information from the mail header -----------------------
> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> Poster:       "Joel S. Berson" <Berson at ATT.NET>
> Subject:      Re: Linguistic dark matter
> -------------------------------------------------------------------------------
>
> 1)  The researchers "estimate the number of English words", but then
> compare their corpus only with dictionaries of American, having
> "348,000 and 116,161 "single-word wordforms".  How many does the OED have>
>
> 2)   I once was a mathematician, but I find myself mightily confused
> by the "eight deciles (ranging from 10-9 - 10-8 to 10-2 -
> 10-1)".  I've never seen such a notation for deciles, which are
> usually referred to as "the 90th percentile" or similarly.  What does
> the notation in the "ranging ..." explanation mean?  The "eight" I
> might understand as omitting two from their study, but which two, and why?
>      And which end is up (high frequency) vs. down (low
> frequency)?  They write "Both dictionaries had excellent coverage of
> high frequency words, but less coverage for frequencies below 10- 6:
> 67% of words in the 10-9 - 10-8 range were listed in neither
> dictionary (Fig. 2B)."  "Frequencies below 10- 6" suggests that the
> smaller numbers are the lower frequencies, but to refer to the "10-9
> - 10-8 range" as the "but less coverage" confuses me.
>
> Joel
>
> At 12/17/2010 11:24 AM, Baker, John wrote:
>>         It seems extremely plausible to me that a high percentage of
>>words is undocumented in standard references.  Many words are obscure
>>and rarely used; lexicographers rightly place reasonably high standards
>>on what words they will include in their dictionaries.
>>
>>         Here is the relevant portion of the article:
>>
>>
>>         <<How many words are in the English language (9)?
>>         We call a 1-gram [i.e., a string of characters uninterrupted by
>>a space, such as "banana" or "3.14159") "common" if its frequency is
>>greater
>>than one per billion. (This corresponds to the frequency of the
>>words listed in leading dictionaries (7).) We compiled a list of
>>all common 1-grams in 1900, 1950, and 2000 based on the
>>frequency of each 1-gram in the preceding decade. These lists
>>contained 1,117,997 common 1-grams in 1900, 1,102,920 in
>>1950, and 1,489,337 in 2000.
>>         Not all common 1-grams are English words. Many fell
>>into three non-word categories: (i) 1-grams with nonalphabetic
>>characters ("l8r", "3.14159"); (ii) misspellings
>>("becuase, "abberation"); and (iii) foreign words
>>("sensitivo").
>>         To estimate the number of English words, we manually
>>annotated random samples from the lists of common 1-grams
>>(7) and determined what fraction were members of the above
>>non-word categories. The result ranged from 51% of all
>>common 1-grams in 1900 to 31% in 2000.
>>         Using this technique, we estimated the number of words in
>>the English lexicon as 544,000 in 1900, 597,000 in 1950, and
>>1,022,000 in 2000. The lexicon is enjoying a period of
>>enormous growth: the addition of ~8500 words/year has
>>increased the size of the language by over 70% during the last
>>fifty years (Fig. 2A).
>>         Notably, we found more words than appear in any
>>dictionary. For instance, the 2002 Webster's Third New
>>International Dictionary [W3], which keeps track of the
>>contemporary American lexicon, lists approximately 348,000
>>single-word wordforms (10); the American Heritage
>>Dictionary of the English Language, Fourth Edition (AHD4)
>>lists 116,161 (11). (Both contain additional multi-word
>>entries.) Part of this gap is because dictionaries often exclude
>>proper nouns and compound words ("whalewatching"). Even
>>accounting for these factors, we found many undocumented
>>words, such as "aridification" (the process by which a
>>geographic region becomes dry), "slenthem" (a musical
>>instrument), and, appropriately, the word "deletable."
>>         This gap between dictionaries and the lexicon results from
>>a balance that every dictionary must strike: it must be
>>comprehensive enough to be a useful reference, but concise
>>enough to be printed, shipped, and used. As such, many
>>infrequent words are omitted. To gauge how well dictionaries
>>reflect the lexicon, we ordered our year 2000 lexicon by
>>frequency, divided it into eight deciles (ranging from 10-9 -
>>10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
>>checked how many sample words were listed in the OED (12)
>>and in the Merriam-Webster Unabridged Dictionary [MWD].
>>(We excluded proper nouns, since neither OED nor MWD
>>lists them.) Both dictionaries had excellent coverage of high
>>frequency words, but less coverage for frequencies below 10-
>>6: 67% of words in the 10-9 - 10-8 range were listed in neither
>>dictionary (Fig. 2B). Consistent with Zipf's famous law, a
>>large fraction of the words in our lexicon (63%) were in this
>>lowest frequency bin. As a result, we estimated that 52% of
>>the English lexicon - the majority of the words used in
>>English books - consists of lexical "dark matter"
>>undocumented in standard references (12).>>
>>
>>
>>         In the online supporting materials, the authors mention that the
>>dictionaries' performance was boosted appreciably by the inclusion of
>>Merriam-Webster's Medical Dictionary (they used the online versions of
>>MW and the OED).  Presumably other specialized dictionaries would have
>>further aided performance.
>>
>>
>>John Baker
>>
>>
>>-----Original Message-----
>>From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf
>>Of Ben Zimmer
>>Sent: Friday, December 17, 2010 10:15 AM
>>To: ADS-L at LISTSERV.UGA.EDU
>>Subject: Re: Linguistic dark matter
>>
>>On Fri, Dec 17, 2010 at 9:14 AM, Michael Quinion
>><wordseditor at worldwidewords.org> wrote:
>> >
>> > David Barnhart wrote
>> >
>> > > If you haven't noticed I'm skeptical of the "tool".
>> >
>> > I'm certainly sceptical of that 52% "undocumented in standard
>>references",
>> > which was why I quoted that sentence. The figure seems extremely high.
>>As
>> > I can't get access to the Science article (which is only fee online to
>> > subscribers), I can't begin to work out its basis.
>> >
>> > The researchers seem not to have applied many lexical filters. Proper
>> > names are included, because they want the corpus to be a cultural tool
>>as
>> > well as a lexicographical one. Similarly, they allow scientific names
>> > ("Turdus merula" and the like). I would have thought that - if the
>> > "standard references" are restricted to general dictionaries - proper
>>and
>> > scientific names would account for a big part of that missing 52%.
>>
>>To be fair, proper nouns were included in the researchers' overall
>>lexical count, but the "dark matter" is not 52% of that number. They
>>did filter out proper nouns of that part of the analysis, since they
>>were going for an apples-to-apples comparison with the OED and
>>Webster's Third. The media coverage doesn't get into these subtleties,
>>of course.
>>
>>--bgz
>>
>>--
>>Ben Zimmer
>>http://benzimmer.com/
>>
>>------------------------------------------------------------
>>The American Dialect Society - http://www.americandialect.org
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list