Linguistic dark matter - obsolete words

Tom Zurinskas truespel at HOTMAIL.COM
Sat Dec 18 11:55:45 UTC 2010


Is there a listing of obsolete words?  Who decides?  Take the term "autoist."

http://headsuptheblog.blogspot.com/2010/12/today-in-journalism-history-words-words.html

There's an interesting abbreev in the above article.  It's "hed", short for "headline".  It's a respelled abbreev.  It's a new word.


Tom Zurinskas, USA - CT20, TN3, NJ33, FL7+
see truespel.com phonetic spelling




> ---------------------- Information from the mail header -----------------------
> Sender: American Dialect Society
> Poster: Jonathan Lighter
> Subject: Re: Linguistic dark matter
> -------------------------------------------------------------------------------
>
> And How many of those "dark" words are unnaturalized foreign words in
> English contexts?
>
> Of course, I imagine their definition of "word" is airtight. Frequent
> two-word compounds, for example, many of which require no dictiomnary entry,
> can be multiplied almost ad lib. Probably the original article
> addresses such issues, but at this point I'd be less than honest if I said I
> cared.
>
> JL
>
> On Fri, Dec 17, 2010 at 11:24 AM, Baker, John wrote:
>
> > ---------------------- Information from the mail header
> > -----------------------
> > Sender: American Dialect Society
> > Poster: "Baker, John"
> > Subject: Re: Linguistic dark matter
> >
> > -------------------------------------------------------------------------------
> >
> > It seems extremely plausible to me that a high percentage of
> > words is undocumented in standard references. Many words are obscure
> > and rarely used; lexicographers rightly place reasonably high standards
> > on what words they will include in their dictionaries.
> >
> > Here is the relevant portion of the article:
> >
> >
> > <> > We call a 1-gram [i.e., a string of characters uninterrupted by
> > a space, such as "banana" or "3.14159") "common" if its frequency is
> > greater
> > than one per billion. (This corresponds to the frequency of the
> > words listed in leading dictionaries (7).) We compiled a list of
> > all common 1-grams in 1900, 1950, and 2000 based on the
> > frequency of each 1-gram in the preceding decade. These lists
> > contained 1,117,997 common 1-grams in 1900, 1,102,920 in
> > 1950, and 1,489,337 in 2000.
> > Not all common 1-grams are English words. Many fell
> > into three non-word categories: (i) 1-grams with nonalphabetic
> > characters ("l8r", "3.14159"); (ii) misspellings
> > ("becuase, "abberation"); and (iii) foreign words
> > ("sensitivo").
> > To estimate the number of English words, we manually
> > annotated random samples from the lists of common 1-grams
> > (7) and determined what fraction were members of the above
> > non-word categories. The result ranged from 51% of all
> > common 1-grams in 1900 to 31% in 2000.
> > Using this technique, we estimated the number of words in
> > the English lexicon as 544,000 in 1900, 597,000 in 1950, and
> > 1,022,000 in 2000. The lexicon is enjoying a period of
> > enormous growth: the addition of ~8500 words/year has
> > increased the size of the language by over 70% during the last
> > fifty years (Fig. 2A).
> > Notably, we found more words than appear in any
> > dictionary. For instance, the 2002 Webster's Third New
> > International Dictionary [W3], which keeps track of the
> > contemporary American lexicon, lists approximately 348,000
> > single-word wordforms (10); the American Heritage
> > Dictionary of the English Language, Fourth Edition (AHD4)
> > lists 116,161 (11). (Both contain additional multi-word
> > entries.) Part of this gap is because dictionaries often exclude
> > proper nouns and compound words ("whalewatching"). Even
> > accounting for these factors, we found many undocumented
> > words, such as "aridification" (the process by which a
> > geographic region becomes dry), "slenthem" (a musical
> > instrument), and, appropriately, the word "deletable."
> > This gap between dictionaries and the lexicon results from
> > a balance that every dictionary must strike: it must be
> > comprehensive enough to be a useful reference, but concise
> > enough to be printed, shipped, and used. As such, many
> > infrequent words are omitted. To gauge how well dictionaries
> > reflect the lexicon, we ordered our year 2000 lexicon by
> > frequency, divided it into eight deciles (ranging from 10-9 -
> > 10-8 to 10-2 - 10-1), and sampled each decile (7). We manually
> > checked how many sample words were listed in the OED (12)
> > and in the Merriam-Webster Unabridged Dictionary [MWD].
> > (We excluded proper nouns, since neither OED nor MWD
> > lists them.) Both dictionaries had excellent coverage of high
> > frequency words, but less coverage for frequencies below 10-
> > 6: 67% of words in the 10-9 - 10-8 range were listed in neither
> > dictionary (Fig. 2B). Consistent with Zipf's famous law, a
> > large fraction of the words in our lexicon (63%) were in this
> > lowest frequency bin. As a result, we estimated that 52% of
> > the English lexicon - the majority of the words used in
> > English books - consists of lexical "dark matter"
> > undocumented in standard references (12).>>
> >
> >
> > In the online supporting materials, the authors mention that the
> > dictionaries' performance was boosted appreciably by the inclusion of
> > Merriam-Webster's Medical Dictionary (they used the online versions of
> > MW and the OED). Presumably other specialized dictionaries would have
> > further aided performance.
> >
> >
> > John Baker
> >
> >
> > -----Original Message-----
> > From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf
> > Of Ben Zimmer
> > Sent: Friday, December 17, 2010 10:15 AM
> > To: ADS-L at LISTSERV.UGA.EDU
> > Subject: Re: Linguistic dark matter
> >
> > On Fri, Dec 17, 2010 at 9:14 AM, Michael Quinion
> > wrote:
> > >
> > > David Barnhart wrote
> > >
> > > > If you haven't noticed I'm skeptical of the "tool".
> > >
> > > I'm certainly sceptical of that 52% "undocumented in standard
> > references",
> > > which was why I quoted that sentence. The figure seems extremely high.
> > As
> > > I can't get access to the Science article (which is only fee online to
> > > subscribers), I can't begin to work out its basis.
> > >
> > > The researchers seem not to have applied many lexical filters. Proper
> > > names are included, because they want the corpus to be a cultural tool
> > as
> > > well as a lexicographical one. Similarly, they allow scientific names
> > > ("Turdus merula" and the like). I would have thought that - if the
> > > "standard references" are restricted to general dictionaries - proper
> > and
> > > scientific names would account for a big part of that missing 52%.
> >
> > To be fair, proper nouns were included in the researchers' overall
> > lexical count, but the "dark matter" is not 52% of that number. They
> > did filter out proper nouns of that part of the analysis, since they
> > were going for an apples-to-apples comparison with the OED and
> > Webster's Third. The media coverage doesn't get into these subtleties,
> > of course.
> >
> > --bgz
> >
> > --
> > Ben Zimmer
> > http://benzimmer.com/
> >
> > ------------------------------------------------------------
> > The American Dialect Society - http://www.americandialect.org
> >
>
>
>
> --
> "If the truth is half as bad as I think it is, you can't handle the truth."
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list