[Ads-l] FW: NGram vs. the OED

Fri May 8 04:17:09 UTC 2015

It's only "mostly useless" if you expect the Ngram Corpus to encompass
all of Google Books, which was never the idea. It's supposed to
present a balanced and representative linguistic sample (in the
tradition of general corpora going back to the Brown Corpus of the
'60s, but orders of magnitude larger than any of them).

Selecting a mega-sample like this also means that the dataset used for
the Ngram Corpus can largely avoid lower-quality OCR in the scanning
project, as well as metadata problems like misdating. The greatest
focus has been on ensuring quality in the period from 1800 to 2000.
More info can be found in the culturomics FAQ:

http://www.culturomics.org/Resources/faq

I'm admittedly an Ngram fan, though I recognize the tool's flaws. Here
are relevant pieces I've written for The Atlantic:

http://www.theatlantic.com/technology/archive/2012/10/bigger-better-google-ngrams-brace-yourself-for-the-power-of-grammar/263487/
http://www.theatlantic.com/technology/archive/2013/10/googles-ngram-viewer-goes-wild/280601/

On Thu, May 7, 2015 at 9:29 PM,  <sclements at neo.rr.com> wrote:
> So, the Ngram is essentially a currently flawed tool that is mostly useless.
> Is that what I should take away from Ben's post?
>
> ---- Ben Zimmer <bgzimmer at GMAIL.COM> wrote:
>> The current dataset for the Ngram corpus goes through 2012 (the
>> original one went through 2009). The 2012 version is described in this
>> paper:
>>
>> http://aclweb.org/anthology/P/P12/P12-3029.pdf
>> "In this work we provide a new edition of the Google Books Ngram
>> Corpus that contains over 8 million books, or 6% of all books ever
>> published."
>>
>> That's a small subset of the total number of volumes scanned and
>> digitized as part of Google Books (currently containing over 30
>> million books).
>>
>>
>> On Thu, May 7, 2015 at 1:31 PM, ADSGarson O'Toole
>> <adsgarsonotoole at gmail.com> wrote:
>> >
>> >
>> > The Ngram database was constructed using a subset of the Google Books
>> > database. Some books used for citations in the OED are not in GB (I
>> > assume). The Wikipedia article for "Google Ngram Viewer" asserts:
>> >
>> > [Begin excerpt]
>> > It was developed by Jon Orwant and Will Brockman and released in
>> > mid-December 2010. . . .
>> > Google populated the database from over 5 million books published up to 2008.
>> > [End excerpt]
>> >
>> > It is possible that the Ngram database has not been updated after
>> > 2010. If this is true then books digitized after 2011 would be absent.
>> >
>> > OCR quality is sometimes poor for older works. Also, I still see
>> > metadata errors with regularity.
>> >
>> > Google Books does currently contain some instances of "Gentleman
>> > Scholar" and "Gentleman-Scholar" before the 1843 date you mentioned.
>> >
>> > The following instance is not hyphenated. The volume was digitized in
>> > March 2011, so it may not be in the Ngram corpus.
>> >
>> > Year: 1674
>> > Title: Remains Concerning Britain: Their Languages, Names, Surnames,
>> > Allusions, Anagramms, Armories, Moneys, Impresses, . . .
>> > Author: William Camden
>> > Publisher: Printed for, and sold by, Charles Harper at the Flower de
>> > Luce over against St. Dunstan's Church, and . . . Fletstreet. London
>> > Quote Page 467
>> > Digitized: Mar 3, 2011
>> >
>> > https://books.google.com/books?id=OEtWAAAAYAAJ&q=%22gentleman+scholar%22#v=snippet&
>> >
>> > [Begin excerpt]
>> > A Gentleman Scholar drawn from the University where he was well liked,
>> > to the Court, for which in respect of his bashful modesty, he was not
>> > fit; . . .
>> > [End excerpt]
>> >
>> > Below is a hyphenated instance in Google Books in 1716.  The book was
>> > digitized in July 2007.
>> >
>> > Year: 1716
>> > Title: Athenae Britannicae, Or, A Critical History of the Oxford and
>> > Cambridge Writers and
>> > Writings . . .
>> > Author: Myles Davies
>> > Publisher: Printed for the Author and by his Appointment only at the
>> > Corner Little Queen Street Holbourn, London
>> >
>> > https://books.google.com/books?id=vycJAAAAQAAJ&q=gentleman-scholar#v=snippet&
>> >
>> > [Begin excerpt]
>> > Whether some of the higher Clergy us'd that Gentleman-Scholar with
>> > unbecoming Imperiousness, or with a Treatment not suitable to his
>> > unexceptionable Parts and Deserts, and he thereupon grew unredressable
>> > and irreconcilable with the whole Order, or no, is uncertain; . . .
>> > [End excerpt]
>> >
>> > Garson
>> >
>> >
>> > On Thu, May 7, 2015 at 1:21 PM, Shapiro, Fred <fred.shapiro at yale.edu> wrote:
>> > > ---------------------- Information from the mail header -----------------------
>> > > Sender:       American Dialect Society <ADS-L at LISTSERV.UGA.EDU>
>> > > Poster:       "Shapiro, Fred" <fred.shapiro at YALE.EDU>
>> > > Subject:      FW: NGram vs. the OED
>> > > -------------------------------------------------------------------------------
>> > >
>> > > =0A=
>> > > Isn't NGram based on the contents of Google Books, rather than on citations=
>> > >  from the OED?  Or are you assuming that everything cited in the OED is als=
>> > > o in Google Books?=0A=
>> > > =0A=
>> > > Fred Shapiro=0A=
>> > > =0A=
>> > > =0A=
>> > > =0A=
>> > > ________________________________________=0A=
>> > > From: American Dialect Society [ADS-L at LISTSERV.UGA.EDU] on behalf of Joel B=
>> > > erson [berson at att.net]=0A=
>> > > Sent: Thursday, May 07, 2015 12:38 PM=0A=
>> > > To: ADS-L at LISTSERV.UGA.EDU=0A=
>> > > Subject: NGram vs. the OED=0A=
>> > > =0A=
>> > > If the OED(2) has quotations for "gentleman-scholar" for 1586 and 1748 (I a=
>> > > ssume it will find more from later years), why does Google's NGram show no =
>> > > occurrences before 1843?=0A=

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org