Geoffrey Nunberg nunberg at CSLI.STANFORD.EDU
Tue Nov 8 05:54:22 UTC 2005

I've found that, as Arnold suggests, you can use a collection of
high-frequency, high-diffusion items as a proxy for database size --
items like "among," "fix," "book," "behalf" and so forth. One problem
with this is that in the most widely used interface, in addition to
having items like "the" on its stop list, Nexis won't return hit
counts over 1,000. In that case you can either take a time-slice (say
one month a year), or do a conjunctive search (e.g., on articles
containing "fix," "buy," and "friend"), or choose some
lower-frequency items like "revoke" and "estimation." While all these
methods are approximate, I've found as Arnold has that they don't
vary too widely from counts done on high-frequency functors like
"the." And given the noise introduced by factors like article
duplication, none of these estimates is going to be exact in any

Geoff Nunberg

>Arnold, you are, of course, right. In this case, though, couldn't all
>uses of the word "gravitas" count towards the total count, whether
>they are multiple quotations of a single usage of the word, multiple
>iterations of the same wire stories in different publications, or
>discussions of the word itself? The reporter's premise as he put it
>to me was on the phone that the word *seemed* to be more common. The
>Factiva numbers support that, since all occurrences of the word he
>saw or heard would add to his impression. His growing feeling that
>the word was more common would not necessarily discern between
>repeats, redundancies, or discussions of the word. His mistake was
>using those numbers to bolster his argument, which turned a statement
>of impression/feeling/opinion into a statement of fact.
>We should be able to account for the count problem by matching it
>with another common word such as "the," although on LexisNexis
>Academic this is impossible because of its stop words (meaning "the"
>is automatically skipped in searches). Factiva *does* appear to
>permit searching for "the." I think we can account in some small way
>for discussions of the word gravitas by eliminating those hits that
>include such phrases as "meaning of (the word) gravitas," "definition
>of (the word) gravitas," "origin of the word/term gravitas," etc.
>Without spending weeks on it and hiring a staff, I see no easy way to
>account for multiple quotations of a single usage of the word
>("Cheney said, 'Bush has the gravitas necessary to be president.'")
>or  multiple iterations of the same wire stories in different
>I don't have the time to redo the searches using the strategies you
>suggest, but my bet is that we'd still see the same spikes in usage
>at the time of the presidential elections, and an overall trend for
>more usage, especially when compared to 1999 and earlier. Beer at the
>ADS meeting in Albuquerque for anyone who does indeed run the numbers.
>On Nov 7, 2005, at 15:19, Arnold M. Zwicky wrote:
>>On Nov 7, 2005, at 11:12 AM, Grant Barrett wrote:
>>>... The proper way to do such data-gathering would have been to
>>>search a
>>>set group of newspapers over that same period of time.
>>even that isn't enough, unless the number of pages (or words)
>>searched remains constant over the search period.  one way to fix
>>that is to normalize by dividing the raw figures for a period by some
>>measure of the number of words searched in that period.  in our work
>>at stanford on quotative "all" in google groups, we (well tom wasow)
>>used a search on the word "the" for this purpose, and then multiplied
>>by a constant to get numbers in some reasonable range (ultimately,
>>estimates of quotative "all" per 100,000 pages).
>>at my suggestion, tom then tried a number of other very common words
>>for normalization, and got results almost scarily close to the ones
>>for "the".  so we think "the" is a pretty good normalizer.
>>one of the things we had to do in counting quotative "all"
>>occurrences was to remove examples in discussions *about* the word,
>>which become fairly frequent when the usage does.  i'd expect a
>>similar problem with "gravitas" citations.  "gravitas" has another
>>problem that is unlikely to be significant for quotative "all":
>>quotations of previous uses.  these two effects would contribute to
>>an increase in "gravitas" hits over time (at least for a while), even
>>if people were not producing more primary occurrences.
>>sampling the data could yield an estimate of these effects for
>>"gravitas", if hand-searching turns out to be onerous.
>>i suspect that the size of these effects isn't constant across vogue
>>words and innovative usages, so they'd have to be estimated for each

