<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

    <title></title>

  </head>

  <body text="#000000" bgcolor="#ffffff">

        Yes, I think the appeal is in the quick interface: all you have

    to do is type in two words and you'll get a cute little graph.  A

    bunch of people are tweeting them up a storm, and now the developers

    have even added a "Tweet" button:<br>

    <br>

    <a class="moz-txt-link-freetext" href="http://twitter.com/#!/search/ngram">http://twitter.com/#!/search/ngram</a><br>

    <br>

    But the corpus also has a lot of slips that can't be rectified

    without a lot of cleanup.  Look at this graph of "hitler" and

    "stalin":<br>

    <br>

<a class="moz-txt-link-freetext" href="http://ngrams.googlelabs.com/graph?content=hitler%2Cstalin&year_start=1850&year_end=2000&corpus=5&smoothing=3">http://ngrams.googlelabs.com/graph?content=hitler%2Cstalin&year_start=1850&year_end=2000&corpus=5&smoothing=3</a><br>

    <br>

    Now look at "Hitler" and "Stalin":<br>

    <br>

<a class="moz-txt-link-freetext" href="http://ngrams.googlelabs.com/graph?content=Hitler%2C+Stalin&year_start=1850&year_end=2000&corpus=5&smoothing=3">http://ngrams.googlelabs.com/graph?content=Hitler%2C+Stalin&year_start=1850&year_end=2000&corpus=5&smoothing=3</a><br>

    <br>

        The queries are case-sensitive, which is no big deal, but what's

    with all the lower-case "hitler"s from the nineteenth century? 

    "Beyond the reach of her <i>hitler </i>and withering sarcasm"? 

    "both in conjunction with his uncle, until the <em>hitler's</em>

    retirement in 1819"?<br>

    <br>

<a class="moz-txt-link-freetext" href="http://www.google.com/search?q=%22hitler%22&tbs=bks:1,cdr:1,cd_min:1850,cd_max:1853&lr=lang_en">http://www.google.com/search?q=%22hitler%22&tbs=bks:1,cdr:1,cd_min:1850,cd_max:1853&lr=lang_en</a><br>

    <br>

    Turns out most of them are OCR errors for "bitter" or "latter." 

    There are also at least two instances where the scanned images for a

    twentieth-century book were tacked onto the end of a

    nineteenth-century book, with the nineteenth-century metadata.  I'm

    surprised that there are so many errors for the decade 1850-1860,

    though.  Maybe the person in charge of OCR for that decade was a

    slacker?<br>

    <br>

    Finally, there's the "long s problem":<br>

    <br>

<a class="moz-txt-link-freetext" href="http://ngrams.googlelabs.com/graph?content=myfterious%2Cmysterious&year_start=1700&year_end=2000&corpus=0&smoothing=5">http://ngrams.googlelabs.com/graph?content=myfterious%2Cmysterious&year_start=1700&year_end=2000&corpus=0&smoothing=5</a><br>

    <pre class="moz-signature" cols="72">-- 

                                -Angus B. Grieve-Smith

                                <a class="moz-txt-link-abbreviated" href="mailto:grvsmth@panix.com">grvsmth@panix.com</a>

</pre>

  </body>

</html>