Linguistic dark matter

Joel S. Berson Berson at ATT.NET
Fri Dec 17 14:41:11 UTC 2010


I forward a message from an 18th-century scholar on another list:
>My immediate sense was that my painstaking
>analysis of 200 novels available in three
>different databases could have been done with
>the click of a button. However I encountered
>immediate problems with this new Google search
>tool. On the first search I did, the first six
>entries for 1795 included 1) the complete works
>of Milton; 2) a collection of British poetry
>(focus on Spenser and Shakespeare), and a
>collection of historic British theatre; 3) two
>dictionaries; and, exactly one book actually
>published in 1795. The usual problems with the
>initial scanning and indexing of documents.

How many (more) errors of scholarship will Google
be siring with this new "tool"?

Joel

At 12/17/2010 08:30 AM, Jonathan Lighter wrote:
>Bad scans by Google must make up a fair number of those "dark" terms and
>undermine the authority of the graphs.
>
>A search for "crud," for example, shows that virtually all examples
>before the late thirties (allegedly) are bad scans of "cruel" and "crude."
>And that's just in English.
>
>JL
>
>On Fri, Dec 17, 2010 at 7:21 AM, Paul Frank <paulfrank at post.harvard.edu>wrote:
>
> > ---------------------- Information from the mail header
> > -----------------------
> > Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> > Poster:       Paul Frank <paulfrank at POST.HARVARD.EDU>
> > Subject:      Re: Linguistic dark matter
> >
> >
> -------------------------------------------------------------------------------
> >
> > On Fri, 17 Dec 2010 11:13 +0000, "Michael Quinion"
> > <wordseditor at WORLDWIDEWORDS.ORG> wrote:
> >
> >
> ------------------------------------------------------------------------------
> > >
> > > Science reports on a massive searchable corpus created from some five
> > > million books, now available on Google: http://ngrams.googlelabs.com/
> > >
> > > One report is here: http://bit.ly/ffQCmR . It quotes the researchers:
> > >
> > > "We estimated that 52% of the English lexicon - the majority of words
> > > used
> > > in English books - consist of lexical 'dark matter' undocumented in
> > > standard references."
> >
> > What's a standard reference? I bet that more than 90% of the technical
> > terms used in agrochemistry, analytical chemistry
> > astrochemistry; acoustics, agrophysics and atomic physics; astrobiology,
> > astrochemistry, astrodynamics, astrometry, astrophysics; atmospheric
> > sciences; anatomy and astrobiology; automata theory, artificial
> > intelligence, algebraic computation; algebra, analysis, applied
> > mathematics, and so on down to the letter z, are not in the OED or in
> > any other single dictionary. And if you take all the technical terms in
> > the social sciences, the arts, and other branches of learning, I bet
> > it's closer to 99%. But that's okay. Tiki mug collectors don't need
> > English dictionaries to tell them what a tiki mug is. And the rest of us
> > can look it up in the Wikipedia
> > (http://en.wikipedia.org/wiki/Tiki_mugs), which which is inching ever
> > closer to Borges' Library of Babel or the planet Memory Alpha, but will
> > never actually get there.
> >
> > Paul
> >
> > --
> >
> > Paul Frank
> > Translator
> > Chinese, German, French, Italian > English
> > Espace de l'Europe 16
> > Neuchâtel, Switzerland
> > paulfrank at bfs.admin.ch
> > paulfrank at post.harvard.edu
> >
> > ------------------------------------------------------------
> > The American Dialect Society - http://www.americandialect.org
> >
>
>
>
>--
>"If the truth is half as bad as I think it is, you can't handle the truth."
>
>------------------------------------------------------------
>The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list