gravitas numbers

Mon Nov 7 20:19:40 UTC 2005

On Nov 7, 2005, at 11:12 AM, Grant Barrett wrote:

> ... The proper way to do such data-gathering would have been to
> search a
> set group of newspapers over that same period of time.

even that isn't enough, unless the number of pages (or words)
searched remains constant over the search period.  one way to fix
that is to normalize by dividing the raw figures for a period by some
measure of the number of words searched in that period.  in our work
at stanford on quotative "all" in google groups, we (well tom wasow)
used a search on the word "the" for this purpose, and then multiplied
by a constant to get numbers in some reasonable range (ultimately,
estimates of quotative "all" per 100,000 pages).

at my suggestion, tom then tried a number of other very common words
for normalization, and got results almost scarily close to the ones
for "the".  so we think "the" is a pretty good normalizer.

one of the things we had to do in counting quotative "all"
occurrences was to remove examples in discussions *about* the word,
which become fairly frequent when the usage does.  i'd expect a
similar problem with "gravitas" citations.  "gravitas" has another
problem that is unlikely to be significant for quotative "all":
quotations of previous uses.  these two effects would contribute to
an increase in "gravitas" hits over time (at least for a while), even
if people were not producing more primary occurrences.

sampling the data could yield an estimate of these effects for
"gravitas", if hand-searching turns out to be onerous.

i suspect that the size of these effects isn't constant across vogue
words and innovative usages, so they'd have to be estimated for each
case.

arnold