[Corpora-List] Wonky ngrams

Alon Lischinsky alischinsky at gmail.com
Fri Jan 4 12:58:19 UTC 2013


On 04/01/13 12:04, Brett Reynolds wrote:

> Can anyone explain why "in spite of" would have a higher frequency than
> "in spite" in the following graph from Google ngrams?
> http://goo.gl/u7J3F

>>From http://books.google.com/ngrams/info:

“What the y-axis shows is this: of all the bigrams contained in our
sample of books written in English and published in the United States,
what percentage of them are [the bigram sought]?”

In other words: the frequencies are calculated over the total number
of N-grams of the same length. Since the denominator in the
calculation changes, a bigram and trigram that are expected to have
almost identical distributions over the corpus (as in your example)
can show slight differences in calculated frequency. (I suspect
rounding errors play a role as well.)

Cheers,

A.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list