[Corpora-List] Culturomics / new Google Books interface / COHA
Angus B. Grieve-Smith
grvsmth at panix.com
Sat Dec 18 15:58:58 UTC 2010
Yes, I think the appeal is in the quick interface: all you have to
do is type in two words and you'll get a cute little graph. A bunch of
people are tweeting them up a storm, and now the developers have even
added a "Tweet" button:
http://twitter.com/#!/search/ngram
But the corpus also has a lot of slips that can't be rectified without a
lot of cleanup. Look at this graph of "hitler" and "stalin":
http://ngrams.googlelabs.com/graph?content=hitler%2Cstalin&year_start=1850&year_end=2000&corpus=5&smoothing=3
Now look at "Hitler" and "Stalin":
http://ngrams.googlelabs.com/graph?content=Hitler%2C+Stalin&year_start=1850&year_end=2000&corpus=5&smoothing=3
The queries are case-sensitive, which is no big deal, but what's
with all the lower-case "hitler"s from the nineteenth century? "Beyond
the reach of her /hitler /and withering sarcasm"? "both in conjunction
with his uncle, until the /hitler's/ retirement in 1819"?
http://www.google.com/search?q=%22hitler%22&tbs=bks:1,cdr:1,cd_min:1850,cd_max:1853&lr=lang_en
Turns out most of them are OCR errors for "bitter" or "latter." There
are also at least two instances where the scanned images for a
twentieth-century book were tacked onto the end of a nineteenth-century
book, with the nineteenth-century metadata. I'm surprised that there
are so many errors for the decade 1850-1860, though. Maybe the person
in charge of OCR for that decade was a slacker?
Finally, there's the "long s problem":
http://ngrams.googlelabs.com/graph?content=myfterious%2Cmysterious&year_start=1700&year_end=2000&corpus=0&smoothing=5
--
-Angus B. Grieve-Smith
grvsmth at panix.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101218/83efa449/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list