Linguistic dark matter

Jonathan Lighter wuxxmupp2000 at GMAIL.COM
Sat Dec 18 16:35:56 UTC 2010


The more I think of it, the more it seems like just another publicity stunt
carried out by people who don't much care that they don't know what they're
doing.

And if the results are meaningless, it doesn't matter, because you can still
listen to your mp3s.




JL

On Sat, Dec 18, 2010 at 7:10 AM, Tom Zurinskas <truespel at hotmail.com> wrote:

> ---------------------- Information from the mail header
> -----------------------
> Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> Poster:       Tom Zurinskas <truespel at HOTMAIL.COM>
> Subject:      Re: Linguistic dark matter
>
> -------------------------------------------------------------------------------
>
> Where can i find in the n-gram database for the most recent year the word
> frequency for the top 10k words?  That would interesting to analyze with
> truespel phonetics.  I've done a 5k analysis of 15.4 million word hits in
> truespel book 4 for comparison.
>
>
> Tom Zurinskas, USA - CT20, TN3, NJ33, FL7+
> see truespel.com phonetic spelling
>
>
>
>
>
> >
> > ---------------------- Information from the mail header
> -----------------------
> > Sender: American Dialect Society
> > Poster: "Baker, John"
> > Subject: Re: Linguistic dark matter
> >
> -------------------------------------------------------------------------------
> >
> > Perhaps I should have included the following passage, which
> > explains how they calculated usage frequency. Also, it should be
> > understood that the number after a 10 (that is, the -9, -8, and so
> > forth) is in superscript in the original, but my ASCII post converted
> > this to normal text. In the following passage, therefore, "10-5" is
> > really 10 to the power of -5. The authors apparently are using "decile"
> > to mean "order of magnitude." Since the powers of 10 are negative
> > powers, a usage frequency of 10-8 would be for a low-frequency word and
> > 10-2 would be for a high-frequency word.
> >
> >
> > <> instances of the n-gram in a given year by the total number of words
> in
> > the corpus in that year. For instance, in 1861, the 1-gram "slavery"
> > appeared in the corpus 21,460 times, on 11,687 pages of 1,208 books.
> > The corpus contains 386,434,758 words from 1861; thus the frequency is
> > 5.5x10-5.>>
> >
> >
> > John Baker
> >
> >
> >
> > -----Original Message-----
> > From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf
> > Of Joel S. Berson
> > Sent: Friday, December 17, 2010 2:07 PM
> > To: ADS-L at LISTSERV.UGA.EDU
> > Subject: Re: Linguistic dark matter
> >
>  > 1) The researchers "estimate the number of English words", but then
> > compare their corpus only with dictionaries of American, having
> > "348,000 and 116,161 "single-word wordforms". How many does the OED
> > have>
> >
> > 2) I once was a mathematician, but I find myself mightily confused
> > by the "eight deciles (ranging from 10-9 - 10-8 to 10-2 -
> > 10-1)". I've never seen such a notation for deciles, which are
> > usually referred to as "the 90th percentile" or similarly. What does
> > the notation in the "ranging ..." explanation mean? The "eight" I
> > might understand as omitting two from their study, but which two, and
> > why?
> > And which end is up (high frequency) vs. down (low
> > frequency)? They write "Both dictionaries had excellent coverage of
> > high frequency words, but less coverage for frequencies below 10- 6:
> > 67% of words in the 10-9 - 10-8 range were listed in neither
> > dictionary (Fig. 2B)." "Frequencies below 10- 6" suggests that the
> > smaller numbers are the lower frequencies, but to refer to the "10-9
> > - 10-8 range" as the "but less coverage" confuses me.
> >
> > Joel
> >
> > At 12/17/2010 11:24 AM, Baker, John wrote:
> > > It seems extremely plausible to me that a high percentage of
> > >words is undocumented in standard references. Many words are obscure
> > >and rarely used; lexicographers rightly place reasonably high standards
> > >on what words they will include in their dictionaries.
> > >
> > > Here is the relevant portion of the article:
> > >
> > >
>  > > <> > We call a 1-gram [i.e., a string of characters uninterrupted
> > by
> > >a space, such as "banana" or "3.14159") "common" if its frequency is
> > >greater
> > >than one per billion. (This corresponds to the frequency of the
> > >words listed in leading dictionaries (7).) We compiled a list of
> > >all common 1-grams in 1900, 1950, and 2000 based on the
> > >frequency of each 1-gram in the preceding decade. These lists
> > >contained 1,117,997 common 1-grams in 1900, 1,102,920 in
> > >1950, and 1,489,337 in 2000.
> > > Not all common 1-grams are English words. Many fell
> > >into three non-word categories: (i) 1-grams with nonalphabetic
> > >characters ("l8r", "3.14159"); (ii) misspellings
> > >("becuase, "abberation"); and (iii) foreign words
> > >("sensitivo").
> > > To estimate the number of English words, we manually
> > >annotated random samples from the lists of common 1-grams
> > >(7) and determined what fraction were members of the above
> > >non-word categories. The result ranged from 51% of all
> > >common 1-grams in 1900 to 31% in 2000.
> > > Using this technique, we estimated the number of words in
> > >the English lexicon as 544,000 in 1900, 597,000 in 1950, and
> > >1,022,000 in 2000. The lexicon is enjoying a period of
> > >enormous growth: the addition of ~8500 words/year has
> > >increased the size of the language by over 70% during the last
> > >fifty years (Fig. 2A).
> > > Notably, we found more words than appear in any
> > >dictionary. For instance, the 2002 Webster's Third New
> > >International Dictionary [W3], which keeps track of the
> > >contemporary American lexicon, lists approximately 348,000
> > >single-word wordforms (10); the American Heritage
> > >Dictionary of the English Language, Fourth Edition (AHD4)
> > >lists 116,161 (11). (Both contain additional multi-word
> > >entries.) Part of this gap is because dictionaries often exclude
> > >proper nouns and compound words ("whalewatching"). Even
> > >accounting for these factors, we found many undocumented
> > >words, such as "aridification" (the process by which a
> > >geographic region becomes dry), "slenthem" (a musical
> > >instrument), and, appropriately, the word "deletable."
> > > This gap between dictionaries and the lexicon results from
> > >a balance that every dictionary must strike: it must be
> > >comprehensive enough to be a useful reference, but concise
> > >enough to be printed, shipped, and used. As such, many
> > >infrequent words are omitted. To gauge how well dictionaries
> > >reflect the lexicon, we ordered our year 2000 lexicon by
> > >frequency, divided it into eight deciles (ranging from 10-9 -
> > >10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
> > >checked how many sample words were listed in the OED (12)
> > >and in the Merriam-Webster Unabridged Dictionary [MWD].
> > >(We excluded proper nouns, since neither OED nor MWD
> > >lists them.) Both dictionaries had excellent coverage of high
> > >frequency words, but less coverage for frequencies below 10-
> > >6: 67% of words in the 10-9 - 10-8 range were listed in neither
> > >dictionary (Fig. 2B). Consistent with Zipf's famous law, a
> > >large fraction of the words in our lexicon (63%) were in this
> > >lowest frequency bin. As a result, we estimated that 52% of
> > >the English lexicon - the majority of the words used in
> > >English books - consists of lexical "dark matter"
> > >undocumented in standard references (12).>>
> > >
> > >
> > > In the online supporting materials, the authors mention that
> > the
> > >dictionaries' performance was boosted appreciably by the inclusion of
> > >Merriam-Webster's Medical Dictionary (they used the online versions of
> > >MW and the OED). Presumably other specialized dictionaries would have
> > >further aided performance.
> > >
> > >
> > >John Baker
> > >
> > >
> > >-----Original Message-----
> > >From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On
> > Behalf
> > >Of Ben Zimmer
> > >Sent: Friday, December 17, 2010 10:15 AM
> > >To: ADS-L at LISTSERV.UGA.EDU
> > >Subject: Re: Linguistic dark matter
> > >
> > >On Fri, Dec 17, 2010 at 9:14 AM, Michael Quinion
> > > wrote:
> > > >
> > > > David Barnhart wrote
> > > >
> > > > > If you haven't noticed I'm skeptical of the "tool".
> > > >
> > > > I'm certainly sceptical of that 52% "undocumented in standard
> > >references",
> > > > which was why I quoted that sentence. The figure seems extremely
> > high.
>  > >As
> > > > I can't get access to the Science article (which is only fee online
> > to
> > > > subscribers), I can't begin to work out its basis.
> > > >
> > > > The researchers seem not to have applied many lexical filters.
> > Proper
> > > > names are included, because they want the corpus to be a cultural
> > tool
> > >as
> > > > well as a lexicographical one. Similarly, they allow scientific
> > names
> > > > ("Turdus merula" and the like). I would have thought that - if the
> > > > "standard references" are restricted to general dictionaries -
> > proper
> > >and
> > > > scientific names would account for a big part of that missing 52%.
> > >
> > >To be fair, proper nouns were included in the researchers' overall
> > >lexical count, but the "dark matter" is not 52% of that number. They
> > >did filter out proper nouns of that part of the analysis, since they
> > >were going for an apples-to-apples comparison with the OED and
> > >Webster's Third. The media coverage doesn't get into these subtleties,
> > >of course.
> > >
> > >--bgz
> > >
> > >--
> > >Ben Zimmer
> > >http://benzimmer.com/
> > >
> > >------------------------------------------------------------
>  > >The American Dialect Society - http://www.americandialect.org
> >
> > ------------------------------------------------------------
> > The American Dialect Society - http://www.americandialect.org
> >
> > ------------------------------------------------------------
> > The American Dialect Society - http://www.americandialect.org
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
>



--
"If the truth is half as bad as I think it is, you can't handle the truth."

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list