[Corpora-List] Wonky ngrams
Alon Lischinsky
alischinsky at gmail.com
Fri Jan 4 12:58:19 UTC 2013
On 04/01/13 12:04, Brett Reynolds wrote:
> Can anyone explain why "in spite of" would have a higher frequency than
> "in spite" in the following graph from Google ngrams?
> http://goo.gl/u7J3F
>>From http://books.google.com/ngrams/info:
“What the y-axis shows is this: of all the bigrams contained in our
sample of books written in English and published in the United States,
what percentage of them are [the bigram sought]?”
In other words: the frequencies are calculated over the total number
of N-grams of the same length. Since the denominator in the
calculation changes, a bigram and trigram that are expected to have
almost identical distributions over the corpus (as in your example)
can show slight differences in calculated frequency. (I suspect
rounding errors play a role as well.)
Cheers,
A.
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list