[Ads-l] FW: NGram vs. the OED

Dave Wilton dave at WILTON.NET
Fri May 8 11:08:00 UTC 2015


I consider the Ngram tool to be an excellent place to initially test a hypothesis and for "quick and dirty" analysis. I wouldn't base a published article on Ngram data.

-----Original Message-----
From: American Dialect Society [mailto:ADS-L at LISTSERV.UGA.EDU] On Behalf Of sclements at NEO.RR.COM
Sent: Thursday, May 07, 2015 9:29 PM
To: ADS-L at LISTSERV.UGA.EDU
Subject: Re: FW: NGram vs. the OED

So, the Ngram is essentially a currently flawed tool that is mostly useless.  Is that what I should take away from Ben's post?

Sam Clements

---- Ben Zimmer <bgzimmer at GMAIL.COM> wrote: 
> The current dataset for the Ngram corpus goes through 2012 (the 
> original one went through 2009). The 2012 version is described in this
> paper:
> 
> http://aclweb.org/anthology/P/P12/P12-3029.pdf
> "In this work we provide a new edition of the Google Books Ngram 
> Corpus that contains over 8 million books, or 6% of all books ever 
> published."
> 
> That's a small subset of the total number of volumes scanned and 
> digitized as part of Google Books (currently containing over 30 
> million books).
> 
> 
> On Thu, May 7, 2015 at 1:31 PM, ADSGarson O'Toole 
> <adsgarsonotoole at gmail.com> wrote:
> >
> >
> > The Ngram database was constructed using a subset of the Google 
> > Books database. Some books used for citations in the OED are not in 
> > GB (I assume). The Wikipedia article for "Google Ngram Viewer" asserts:
> >
> > [Begin excerpt]
> > It was developed by Jon Orwant and Will Brockman and released in 
> > mid-December 2010. . . .
> > Google populated the database from over 5 million books published up to 2008.
> > [End excerpt]
> >
> > It is possible that the Ngram database has not been updated after 
> > 2010. If this is true then books digitized after 2011 would be absent.
> >
> > OCR quality is sometimes poor for older works. Also, I still see 
> > metadata errors with regularity.
> >
> > Google Books does currently contain some instances of "Gentleman 
> > Scholar" and "Gentleman-Scholar" before the 1843 date you mentioned.
> >
> > The following instance is not hyphenated. The volume was digitized 
> > in March 2011, so it may not be in the Ngram corpus.
> >
> > Year: 1674
> > Title: Remains Concerning Britain: Their Languages, Names, Surnames, 
> > Allusions, Anagramms, Armories, Moneys, Impresses, . . .
> > Author: William Camden
> > Publisher: Printed for, and sold by, Charles Harper at the Flower de 
> > Luce over against St. Dunstan's Church, and . . . Fletstreet. London 
> > Quote Page 467
> > Digitized: Mar 3, 2011
> >
> > https://books.google.com/books?id=OEtWAAAAYAAJ&q=%22gentleman+schola
> > r%22#v=snippet&
> >
> > [Begin excerpt]
> > A Gentleman Scholar drawn from the University where he was well 
> > liked, to the Court, for which in respect of his bashful modesty, he 
> > was not fit; . . .
> > [End excerpt]
> >
> > Below is a hyphenated instance in Google Books in 1716.  The book 
> > was digitized in July 2007.
> >
> > Year: 1716
> > Title: Athenae Britannicae, Or, A Critical History of the Oxford and 
> > Cambridge Writers and Writings . . .
> > Author: Myles Davies
> > Publisher: Printed for the Author and by his Appointment only at the 
> > Corner Little Queen Street Holbourn, London
> >
> > https://books.google.com/books?id=vycJAAAAQAAJ&q=gentleman-scholar#v
> > =snippet&
> >
> > [Begin excerpt]
> > Whether some of the higher Clergy us'd that Gentleman-Scholar with 
> > unbecoming Imperiousness, or with a Treatment not suitable to his 
> > unexceptionable Parts and Deserts, and he thereupon grew 
> > unredressable and irreconcilable with the whole Order, or no, is uncertain; . . .
> > [End excerpt]
> >
> > Garson
> >
> >
> > On Thu, May 7, 2015 at 1:21 PM, Shapiro, Fred <fred.shapiro at yale.edu> wrote:
> > > ---------------------- Information from the mail header -----------------------
> > > Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> > > Poster:       "Shapiro, Fred" <fred.shapiro at YALE.EDU>
> > > Subject:      FW: NGram vs. the OED
> > > ------------------------------------------------------------------
> > > -------------
> > >
> > > =0A=
> > > Isn't NGram based on the contents of Google Books, rather than on 
> > > citations=  from the OED?  Or are you assuming that everything 
> > > cited in the OED is als= o in Google Books?=0A= =0A= Fred 
> > > Shapiro=0A= =0A= =0A= =0A= 
> > > ________________________________________=0A=
> > > From: American Dialect Society [ADS-L at LISTSERV.UGA.EDU] on behalf 
> > > of Joel B= erson [berson at att.net]=0A=
> > > Sent: Thursday, May 07, 2015 12:38 PM=0A=
> > > To: ADS-L at LISTSERV.UGA.EDU=0A=
> > > Subject: NGram vs. the OED=0A=
> > > =0A=
> > > If the OED(2) has quotations for "gentleman-scholar" for 1586 and 
> > > 1748 (I a= ssume it will find more from later years), why does 
> > > Google's NGram show no = occurrences before 1843?=0A=
> 
> ------------------------------------------------------------
> The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list