[Ads-l] FW: NGram vs. the OED
Dave Wilton
dave at WILTON.NET
Fri May 8 11:08:00 UTC 2015
I consider the Ngram tool to be an excellent place to initially test a hypothesis and for "quick and dirty" analysis. I wouldn't base a published article on Ngram data.
-----Original Message-----
From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf Of sclements at NEO.RR.COM
Sent: Thursday, May 07, 2015 9:29 PM
To: ADS-L at LISTSERV.UGA.EDU
Subject: Re: FW: NGram vs. the OED
So, the Ngram is essentially a currently flawed tool that is mostly useless. Is that what I should take away from Ben's post?
Sam Clements
---- Ben Zimmer <bgzimmer at GMAIL.COM> wrote:
> The current dataset for the Ngram corpus goes through 2012 (the
> original one went through 2009). The 2012 version is described in this
> paper:
>
> http://aclweb.org/anthology/P/P12/P12-3029.pdf
> "In this work we provide a new edition of the Google Books Ngram
> Corpus that contains over 8 million books, or 6% of all books ever
> published."
>
> That's a small subset of the total number of volumes scanned and
> digitized as part of Google Books (currently containing over 30
> million books).
>
>
> On Thu, May 7, 2015 at 1:31 PM, ADSGarson O'Toole
> <adsgarsonotoole at gmail.com> wrote:
> >
> >
> > The Ngram database was constructed using a subset of the Google
> > Books database. Some books used for citations in the OED are not in
> > GB (I assume). The Wikipedia article for "Google Ngram Viewer" asserts:
> >
> > [Begin excerpt]
> > It was developed by Jon Orwant and Will Brockman and released in
> > mid-December 2010. . . .
> > Google populated the database from over 5 million books published up to 2008.
> > [End excerpt]
> >
> > It is possible that the Ngram database has not been updated after
> > 2010. If this is true then books digitized after 2011 would be absent.
> >
> > OCR quality is sometimes poor for older works. Also, I still see
> > metadata errors with regularity.
> >
> > Google Books does currently contain some instances of "Gentleman
> > Scholar" and "Gentleman-Scholar" before the 1843 date you mentioned.
> >
> > The following instance is not hyphenated. The volume was digitized
> > in March 2011, so it may not be in the Ngram corpus.
> >
> > Year: 1674
> > Title: Remains Concerning Britain: Their Languages, Names, Surnames,
> > Allusions, Anagramms, Armories, Moneys, Impresses, . . .
> > Author: William Camden
> > Publisher: Printed for, and sold by, Charles Harper at the Flower de
> > Luce over against St. Dunstan's Church, and . . . Fletstreet. London
> > Quote Page 467
> > Digitized: Mar 3, 2011
> >
> > https://books.google.com/books?id=OEtWAAAAYAAJ&q=%22gentleman+schola
> > r%22#v=snippet&
> >
> > [Begin excerpt]
> > A Gentleman Scholar drawn from the University where he was well
> > liked, to the Court, for which in respect of his bashful modesty, he
> > was not fit; . . .
> > [End excerpt]
> >
> > Below is a hyphenated instance in Google Books in 1716. The book
> > was digitized in July 2007.
> >
> > Year: 1716
> > Title: Athenae Britannicae, Or, A Critical History of the Oxford and
> > Cambridge Writers and Writings . . .
> > Author: Myles Davies
> > Publisher: Printed for the Author and by his Appointment only at the
> > Corner Little Queen Street Holbourn, London
> >
> > https://books.google.com/books?id=vycJAAAAQAAJ&q=gentleman-scholar#v
> > =snippet&
> >
> > [Begin excerpt]
> > Whether some of the higher Clergy us'd that Gentleman-Scholar with
> > unbecoming Imperiousness, or with a Treatment not suitable to his
> > unexceptionable Parts and Deserts, and he thereupon grew
> > unredressable and irreconcilable with the whole Order, or no, is uncertain; . . .
> > [End excerpt]
> >
> > Garson
> >
> >
> > On Thu, May 7, 2015 at 1:21 PM, Shapiro, Fred <fred.shapiro at yale.edu> wrote:
> > > ---------------------- Information from the mail header -----------------------
> > > Sender: American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> > > Poster: "Shapiro, Fred" <fred.shapiro at YALE.EDU>
> > > Subject: FW: NGram vs. the OED
> > > ------------------------------------------------------------------
> > > -------------
> > >
> > > =0A=
> > > Isn't NGram based on the contents of Google Books, rather than on
> > > citations= from the OED? Or are you assuming that everything
> > > cited in the OED is als= o in Google Books?=0A= =0A= Fred
> > > Shapiro=0A= =0A= =0A= =0A=
> > > ________________________________________=0A=
> > > From: American Dialect Society [ADS-L at LISTSERV.UGA.EDU] on behalf
> > > of Joel B= erson [berson at att.net]=0A=
> > > Sent: Thursday, May 07, 2015 12:38 PM=0A=
> > > To: ADS-L at LISTSERV.UGA.EDU=0A=
> > > Subject: NGram vs. the OED=0A=
> > > =0A=
> > > If the OED(2) has quotations for "gentleman-scholar" for 1586 and
> > > 1748 (I a= ssume it will find more from later years), why does
> > > Google's NGram show no = occurrences before 1843?=0A=
>
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org
------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org
------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org
More information about the Ads-l
mailing list