<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
    <title></title>
  </head>
  <body text="#000000" bgcolor="#ffffff">
        Yes, I think the appeal is in the quick interface: all you have
    to do is type in two words and you'll get a cute little graph.  A
    bunch of people are tweeting them up a storm, and now the developers
    have even added a "Tweet" button:<br>
    <br>
    <a class="moz-txt-link-freetext" href="http://twitter.com/#!/search/ngram">http://twitter.com/#!/search/ngram</a><br>
    <br>
    But the corpus also has a lot of slips that can't be rectified
    without a lot of cleanup.  Look at this graph of "hitler" and
    "stalin":<br>
    <br>
<a class="moz-txt-link-freetext" href="http://ngrams.googlelabs.com/graph?content=hitler%2Cstalin&year_start=1850&year_end=2000&corpus=5&smoothing=3">http://ngrams.googlelabs.com/graph?content=hitler%2Cstalin&year_start=1850&year_end=2000&corpus=5&smoothing=3</a><br>
    <br>
    Now look at "Hitler" and "Stalin":<br>
    <br>
<a class="moz-txt-link-freetext" href="http://ngrams.googlelabs.com/graph?content=Hitler%2C+Stalin&year_start=1850&year_end=2000&corpus=5&smoothing=3">http://ngrams.googlelabs.com/graph?content=Hitler%2C+Stalin&year_start=1850&year_end=2000&corpus=5&smoothing=3</a><br>
    <br>
        The queries are case-sensitive, which is no big deal, but what's
    with all the lower-case "hitler"s from the nineteenth century? 
    "Beyond the reach of her <i>hitler </i>and withering sarcasm"? 
    "both in conjunction with his uncle, until the <em>hitler's</em>
    retirement in 1819"?<br>
    <br>
<a class="moz-txt-link-freetext" href="http://www.google.com/search?q=%22hitler%22&tbs=bks:1,cdr:1,cd_min:1850,cd_max:1853&lr=lang_en">http://www.google.com/search?q=%22hitler%22&tbs=bks:1,cdr:1,cd_min:1850,cd_max:1853&lr=lang_en</a><br>
    <br>
    Turns out most of them are OCR errors for "bitter" or "latter." 
    There are also at least two instances where the scanned images for a
    twentieth-century book were tacked onto the end of a
    nineteenth-century book, with the nineteenth-century metadata.  I'm
    surprised that there are so many errors for the decade 1850-1860,
    though.  Maybe the person in charge of OCR for that decade was a
    slacker?<br>
    <br>
    Finally, there's the "long s problem":<br>
    <br>
<a class="moz-txt-link-freetext" href="http://ngrams.googlelabs.com/graph?content=myfterious%2Cmysterious&year_start=1700&year_end=2000&corpus=0&smoothing=5">http://ngrams.googlelabs.com/graph?content=myfterious%2Cmysterious&year_start=1700&year_end=2000&corpus=0&smoothing=5</a><br>
    <pre class="moz-signature" cols="72">-- 
                                -Angus B. Grieve-Smith
                                <a class="moz-txt-link-abbreviated" href="mailto:grvsmth@panix.com">grvsmth@panix.com</a>
</pre>
  </body>
</html>