Linguistic dark matter

Sat Dec 18 12:10:26 UTC 2010

Where can i find in the n-gram database for the most recent year the word frequency for the top 10k words?  That would interesting to analyze with truespel phonetics.  I've done a 5k analysis of 15.4 million word hits in truespel book 4 for comparison.

Tom Zurinskas, USA - CT20, TN3, NJ33, FL7+
see truespel.com phonetic spelling

>
> ---------------------- Information from the mail header -----------------------
> Sender: American Dialect Society
> Poster: "Baker, John"
> Subject: Re: Linguistic dark matter
> -------------------------------------------------------------------------------
>
> Perhaps I should have included the following passage, which
> explains how they calculated usage frequency. Also, it should be
> understood that the number after a 10 (that is, the -9, -8, and so
> forth) is in superscript in the original, but my ASCII post converted
> this to normal text. In the following passage, therefore, "10-5" is
> really 10 to the power of -5. The authors apparently are using "decile"
> to mean "order of magnitude." Since the powers of 10 are negative
> powers, a usage frequency of 10-8 would be for a low-frequency word and
> 10-2 would be for a high-frequency word.
>
>
> <> instances of the n-gram in a given year by the total number of words in
> the corpus in that year. For instance, in 1861, the 1-gram "slavery"
> appeared in the corpus 21,460 times, on 11,687 pages of 1,208 books.
> The corpus contains 386,434,758 words from 1861; thus the frequency is
> 5.5x10-5.>>
>
>
> John Baker
>
>
>
> -----Original Message-----
> From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf
> Of Joel S. Berson
> Sent: Friday, December 17, 2010 2:07 PM
> To: ADS-L at LISTSERV.UGA.EDU
> Subject: Re: Linguistic dark matter
>
> 1) The researchers "estimate the number of English words", but then
> compare their corpus only with dictionaries of American, having
> "348,000 and 116,161 "single-word wordforms". How many does the OED
> have>
>
> 2) I once was a mathematician, but I find myself mightily confused
> by the "eight deciles (ranging from 10-9 - 10-8 to 10-2 -
> 10-1)". I've never seen such a notation for deciles, which are
> usually referred to as "the 90th percentile" or similarly. What does
> the notation in the "ranging ..." explanation mean? The "eight" I
> might understand as omitting two from their study, but which two, and
> why?
> And which end is up (high frequency) vs. down (low
> frequency)? They write "Both dictionaries had excellent coverage of
> high frequency words, but less coverage for frequencies below 10- 6:
> 67% of words in the 10-9 - 10-8 range were listed in neither
> dictionary (Fig. 2B)." "Frequencies below 10- 6" suggests that the
> smaller numbers are the lower frequencies, but to refer to the "10-9
> - 10-8 range" as the "but less coverage" confuses me.
>
> Joel
>
> At 12/17/2010 11:24 AM, Baker, John wrote:
> > It seems extremely plausible to me that a high percentage of
> >words is undocumented in standard references. Many words are obscure
> >and rarely used; lexicographers rightly place reasonably high standards
> >on what words they will include in their dictionaries.
> >
> > Here is the relevant portion of the article:
> >
> >
> > <> > We call a 1-gram [i.e., a string of characters uninterrupted
> by
> >a space, such as "banana" or "3.14159") "common" if its frequency is
> >greater
> >than one per billion. (This corresponds to the frequency of the
> >words listed in leading dictionaries (7).) We compiled a list of
> >all common 1-grams in 1900, 1950, and 2000 based on the
> >frequency of each 1-gram in the preceding decade. These lists
> >contained 1,117,997 common 1-grams in 1900, 1,102,920 in
> >1950, and 1,489,337 in 2000.
> > Not all common 1-grams are English words. Many fell
> >into three non-word categories: (i) 1-grams with nonalphabetic
> >characters ("l8r", "3.14159"); (ii) misspellings
> >("becuase, "abberation"); and (iii) foreign words
> >("sensitivo").
> > To estimate the number of English words, we manually
> >annotated random samples from the lists of common 1-grams
> >(7) and determined what fraction were members of the above
> >non-word categories. The result ranged from 51% of all
> >common 1-grams in 1900 to 31% in 2000.
> > Using this technique, we estimated the number of words in
> >the English lexicon as 544,000 in 1900, 597,000 in 1950, and
> >1,022,000 in 2000. The lexicon is enjoying a period of
> >enormous growth: the addition of ~8500 words/year has
> >increased the size of the language by over 70% during the last
> >fifty years (Fig. 2A).
> > Notably, we found more words than appear in any
> >dictionary. For instance, the 2002 Webster's Third New
> >International Dictionary [W3], which keeps track of the
> >contemporary American lexicon, lists approximately 348,000
> >single-word wordforms (10); the American Heritage
> >Dictionary of the English Language, Fourth Edition (AHD4)
> >lists 116,161 (11). (Both contain additional multi-word
> >entries.) Part of this gap is because dictionaries often exclude
> >proper nouns and compound words ("whalewatching"). Even
> >accounting for these factors, we found many undocumented
> >words, such as "aridification" (the process by which a
> >geographic region becomes dry), "slenthem" (a musical
> >instrument), and, appropriately, the word "deletable."
> > This gap between dictionaries and the lexicon results from
> >a balance that every dictionary must strike: it must be
> >comprehensive enough to be a useful reference, but concise
> >enough to be printed, shipped, and used. As such, many
> >infrequent words are omitted. To gauge how well dictionaries
> >reflect the lexicon, we ordered our year 2000 lexicon by
> >frequency, divided it into eight deciles (ranging from 10-9 -
> >10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
> >checked how many sample words were listed in the OED (12)
> >and in the Merriam-Webster Unabridged Dictionary [MWD].
> >(We excluded proper nouns, since neither OED nor MWD
> >lists them.) Both dictionaries had excellent coverage of high
> >frequency words, but less coverage for frequencies below 10-
> >6: 67% of words in the 10-9 - 10-8 range were listed in neither
> >dictionary (Fig. 2B). Consistent with Zipf's famous law, a
> >large fraction of the words in our lexicon (63%) were in this
> >lowest frequency bin. As a result, we estimated that 52% of
> >the English lexicon - the majority of the words used in
> >English books - consists of lexical "dark matter"
> >undocumented in standard references (12).>>
> >
> >
> > In the online supporting materials, the authors mention that
> the
> >dictionaries' performance was boosted appreciably by the inclusion of
> >Merriam-Webster's Medical Dictionary (they used the online versions of
> >MW and the OED). Presumably other specialized dictionaries would have
> >further aided performance.
> >
> >
> >John Baker
> >
> >
> >-----Original Message-----
> >From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On
> Behalf
> >Of Ben Zimmer
> >Sent: Friday, December 17, 2010 10:15 AM
> >To: ADS-L at LISTSERV.UGA.EDU
> >Subject: Re: Linguistic dark matter
> >
> >On Fri, Dec 17, 2010 at 9:14 AM, Michael Quinion
> > wrote:
> > >
> > > David Barnhart wrote
> > >
> > > > If you haven't noticed I'm skeptical of the "tool".
> > >
> > > I'm certainly sceptical of that 52% "undocumented in standard
> >references",
> > > which was why I quoted that sentence. The figure seems extremely
> high.
> >As
> > > I can't get access to the Science article (which is only fee online
> to
> > > subscribers), I can't begin to work out its basis.
> > >
> > > The researchers seem not to have applied many lexical filters.
> Proper
> > > names are included, because they want the corpus to be a cultural
> tool
> >as
> > > well as a lexicographical one. Similarly, they allow scientific
> names
> > > ("Turdus merula" and the like). I would have thought that - if the
> > > "standard references" are restricted to general dictionaries -
> proper
> >and
> > > scientific names would account for a big part of that missing 52%.
> >
> >To be fair, proper nouns were included in the researchers' overall
> >lexical count, but the "dark matter" is not 52% of that number. They
> >did filter out proper nouns of that part of the analysis, since they
> >were going for an apples-to-apples comparison with the OED and
> >Webster's Third. The media coverage doesn't get into these subtleties,
> >of course.
> >
> >--bgz
> >
> >--
> >Ben Zimmer
> >http://benzimmer.com/
> >
> >------------------------------------------------------------
> >The American Dialect Society - http://www.americandialect.org
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org