[Ads-l] FW: NGram vs. the OED

sclements at NEO.RR.COM sclements at NEO.RR.COM
Fri May 8 01:29:01 UTC 2015


So, the Ngram is essentially a currently flawed tool that is mostly useless.  Is that what I should take away from Ben's post?

Sam Clements

---- Ben Zimmer <bgzimmer at GMAIL.COM> wrote: 
> The current dataset for the Ngram corpus goes through 2012 (the
> original one went through 2009). The 2012 version is described in this
> paper:
> 
> http://aclweb.org/anthology/P/P12/P12-3029.pdf
> "In this work we provide a new edition of the Google Books Ngram
> Corpus that contains over 8 million books, or 6% of all books ever
> published."
> 
> That's a small subset of the total number of volumes scanned and
> digitized as part of Google Books (currently containing over 30
> million books).
> 
> 
> On Thu, May 7, 2015 at 1:31 PM, ADSGarson O'Toole
> <adsgarsonotoole at gmail.com> wrote:
> >
> >
> > The Ngram database was constructed using a subset of the Google Books
> > database. Some books used for citations in the OED are not in GB (I
> > assume). The Wikipedia article for "Google Ngram Viewer" asserts:
> >
> > [Begin excerpt]
> > It was developed by Jon Orwant and Will Brockman and released in
> > mid-December 2010. . . .
> > Google populated the database from over 5 million books published up to 2008.
> > [End excerpt]
> >
> > It is possible that the Ngram database has not been updated after
> > 2010. If this is true then books digitized after 2011 would be absent.
> >
> > OCR quality is sometimes poor for older works. Also, I still see
> > metadata errors with regularity.
> >
> > Google Books does currently contain some instances of "Gentleman
> > Scholar" and "Gentleman-Scholar" before the 1843 date you mentioned.
> >
> > The following instance is not hyphenated. The volume was digitized in
> > March 2011, so it may not be in the Ngram corpus.
> >
> > Year: 1674
> > Title: Remains Concerning Britain: Their Languages, Names, Surnames,
> > Allusions, Anagramms, Armories, Moneys, Impresses, . . .
> > Author: William Camden
> > Publisher: Printed for, and sold by, Charles Harper at the Flower de
> > Luce over against St. Dunstan's Church, and . . . Fletstreet. London
> > Quote Page 467
> > Digitized: Mar 3, 2011
> >
> > https://books.google.com/books?id=OEtWAAAAYAAJ&q=%22gentleman+scholar%22#v=snippet&
> >
> > [Begin excerpt]
> > A Gentleman Scholar drawn from the University where he was well liked,
> > to the Court, for which in respect of his bashful modesty, he was not
> > fit; . . .
> > [End excerpt]
> >
> > Below is a hyphenated instance in Google Books in 1716.  The book was
> > digitized in July 2007.
> >
> > Year: 1716
> > Title: Athenae Britannicae, Or, A Critical History of the Oxford and
> > Cambridge Writers and
> > Writings . . .
> > Author: Myles Davies
> > Publisher: Printed for the Author and by his Appointment only at the
> > Corner Little Queen Street Holbourn, London
> >
> > https://books.google.com/books?id=vycJAAAAQAAJ&q=gentleman-scholar#v=snippet&
> >
> > [Begin excerpt]
> > Whether some of the higher Clergy us'd that Gentleman-Scholar with
> > unbecoming Imperiousness, or with a Treatment not suitable to his
> > unexceptionable Parts and Deserts, and he thereupon grew unredressable
> > and irreconcilable with the whole Order, or no, is uncertain; . . .
> > [End excerpt]
> >
> > Garson
> >
> >
> > On Thu, May 7, 2015 at 1:21 PM, Shapiro, Fred <fred.shapiro at yale.edu> wrote:
> > > ---------------------- Information from the mail header -----------------------
> > > Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> > > Poster:       "Shapiro, Fred" <fred.shapiro at YALE.EDU>
> > > Subject:      FW: NGram vs. the OED
> > > -------------------------------------------------------------------------------
> > >
> > > =0A=
> > > Isn't NGram based on the contents of Google Books, rather than on citations=
> > >  from the OED?  Or are you assuming that everything cited in the OED is als=
> > > o in Google Books?=0A=
> > > =0A=
> > > Fred Shapiro=0A=
> > > =0A=
> > > =0A=
> > > =0A=
> > > ________________________________________=0A=
> > > From: American Dialect Society [ADS-L at LISTSERV.UGA.EDU] on behalf of Joel B=
> > > erson [berson at att.net]=0A=
> > > Sent: Thursday, May 07, 2015 12:38 PM=0A=
> > > To: ADS-L at LISTSERV.UGA.EDU=0A=
> > > Subject: NGram vs. the OED=0A=
> > > =0A=
> > > If the OED(2) has quotations for "gentleman-scholar" for 1586 and 1748 (I a=
> > > ssume it will find more from later years), why does Google's NGram show no =
> > > occurrences before 1843?=0A=
> 
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org


More information about the Ads-l mailing list