[Ads-l] FW: NGram vs. the OED

Ben Zimmer bgzimmer at GMAIL.COM
Thu May 7 17:49:14 UTC 2015


The current dataset for the Ngram corpus goes through 2012 (the
original one went through 2009). The 2012 version is described in this
paper:

http://aclweb.org/anthology/P/P12/P12-3029.pdf
"In this work we provide a new edition of the Google Books Ngram
Corpus that contains over 8 million books, or 6% of all books ever
published."

That's a small subset of the total number of volumes scanned and
digitized as part of Google Books (currently containing over 30
million books).


On Thu, May 7, 2015 at 1:31 PM, ADSGarson O'Toole
<adsgarsonotoole at gmail.com> wrote:
>
>
> The Ngram database was constructed using a subset of the Google Books
> database. Some books used for citations in the OED are not in GB (I
> assume). The Wikipedia article for "Google Ngram Viewer" asserts:
>
> [Begin excerpt]
> It was developed by Jon Orwant and Will Brockman and released in
> mid-December 2010. . . .
> Google populated the database from over 5 million books published up to 2008.
> [End excerpt]
>
> It is possible that the Ngram database has not been updated after
> 2010. If this is true then books digitized after 2011 would be absent.
>
> OCR quality is sometimes poor for older works. Also, I still see
> metadata errors with regularity.
>
> Google Books does currently contain some instances of "Gentleman
> Scholar" and "Gentleman-Scholar" before the 1843 date you mentioned.
>
> The following instance is not hyphenated. The volume was digitized in
> March 2011, so it may not be in the Ngram corpus.
>
> Year: 1674
> Title: Remains Concerning Britain: Their Languages, Names, Surnames,
> Allusions, Anagramms, Armories, Moneys, Impresses, . . .
> Author: William Camden
> Publisher: Printed for, and sold by, Charles Harper at the Flower de
> Luce over against St. Dunstan's Church, and . . . Fletstreet. London
> Quote Page 467
> Digitized: Mar 3, 2011
>
> https://books.google.com/books?id=OEtWAAAAYAAJ&q=%22gentleman+scholar%22#v=snippet&
>
> [Begin excerpt]
> A Gentleman Scholar drawn from the University where he was well liked,
> to the Court, for which in respect of his bashful modesty, he was not
> fit; . . .
> [End excerpt]
>
> Below is a hyphenated instance in Google Books in 1716.  The book was
> digitized in July 2007.
>
> Year: 1716
> Title: Athenae Britannicae, Or, A Critical History of the Oxford and
> Cambridge Writers and
> Writings . . .
> Author: Myles Davies
> Publisher: Printed for the Author and by his Appointment only at the
> Corner Little Queen Street Holbourn, London
>
> https://books.google.com/books?id=vycJAAAAQAAJ&q=gentleman-scholar#v=snippet&
>
> [Begin excerpt]
> Whether some of the higher Clergy us'd that Gentleman-Scholar with
> unbecoming Imperiousness, or with a Treatment not suitable to his
> unexceptionable Parts and Deserts, and he thereupon grew unredressable
> and irreconcilable with the whole Order, or no, is uncertain; . . .
> [End excerpt]
>
> Garson
>
>
> On Thu, May 7, 2015 at 1:21 PM, Shapiro, Fred <fred.shapiro at yale.edu> wrote:
> > ---------------------- Information from the mail header -----------------------
> > Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
> > Poster:       "Shapiro, Fred" <fred.shapiro at YALE.EDU>
> > Subject:      FW: NGram vs. the OED
> > -------------------------------------------------------------------------------
> >
> > =0A=
> > Isn't NGram based on the contents of Google Books, rather than on citations=
> >  from the OED?  Or are you assuming that everything cited in the OED is als=
> > o in Google Books?=0A=
> > =0A=
> > Fred Shapiro=0A=
> > =0A=
> > =0A=
> > =0A=
> > ________________________________________=0A=
> > From: American Dialect Society [ADS-L at LISTSERV.UGA.EDU] on behalf of Joel B=
> > erson [berson at att.net]=0A=
> > Sent: Thursday, May 07, 2015 12:38 PM=0A=
> > To: ADS-L at LISTSERV.UGA.EDU=0A=
> > Subject: NGram vs. the OED=0A=
> > =0A=
> > If the OED(2) has quotations for "gentleman-scholar" for 1586 and 1748 (I a=
> > ssume it will find more from later years), why does Google's NGram show no =
> > occurrences before 1843?=0A=

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org



More information about the Ads-l mailing list