[Corpora-List] Culturomics / new Google Books interface / COHA

Angus B. Grieve-Smith grvsmth at panix.com
Sat Dec 18 15:58:58 UTC 2010


     Yes, I think the appeal is in the quick interface: all you have to 
do is type in two words and you'll get a cute little graph.  A bunch of 
people are tweeting them up a storm, and now the developers have even 
added a "Tweet" button:

http://twitter.com/#!/search/ngram

But the corpus also has a lot of slips that can't be rectified without a 
lot of cleanup.  Look at this graph of "hitler" and "stalin":

http://ngrams.googlelabs.com/graph?content=hitler%2Cstalin&year_start=1850&year_end=2000&corpus=5&smoothing=3

Now look at "Hitler" and "Stalin":

http://ngrams.googlelabs.com/graph?content=Hitler%2C+Stalin&year_start=1850&year_end=2000&corpus=5&smoothing=3

     The queries are case-sensitive, which is no big deal, but what's 
with all the lower-case "hitler"s from the nineteenth century?  "Beyond 
the reach of her /hitler /and withering sarcasm"?  "both in conjunction 
with his uncle, until the /hitler's/ retirement in 1819"?

http://www.google.com/search?q=%22hitler%22&tbs=bks:1,cdr:1,cd_min:1850,cd_max:1853&lr=lang_en

Turns out most of them are OCR errors for "bitter" or "latter."  There 
are also at least two instances where the scanned images for a 
twentieth-century book were tacked onto the end of a nineteenth-century 
book, with the nineteenth-century metadata.  I'm surprised that there 
are so many errors for the decade 1850-1860, though.  Maybe the person 
in charge of OCR for that decade was a slacker?

Finally, there's the "long s problem":

http://ngrams.googlelabs.com/graph?content=myfterious%2Cmysterious&year_start=1700&year_end=2000&corpus=0&smoothing=5

-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101218/83efa449/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list