[Corpora-List] Google Ngram count

Miles Osborne miles at inf.ed.ac.uk
Mon Sep 1 13:14:21 UTC 2008


I can't comment on the status of counts derived from Google's search
page and there is no published statement on the relationship (if any)
between the Ngram counts and any other counts.

That aside,  as with any resource gathered from the Web, caveat
emptor.  I do know of people using the Ngram set within SMT language
models to advantage, so it can be a useful resource.

Miles

2008/8/31 Christopher Brewster <C.Brewster at dcs.shef.ac.uk>:
> There was an extensive analysis by Jean Veronis a while ago showing that
> Google counts were invalid (long before the release of the ngram corpus).
> Is that analysis still valid?
> Does that mean we should not trust Google's ngram corpus at all?
>
> Christopher
>
> *****************************************************
> Department of Computer Science, University of Sheffield
> Regent Court, 211 Portobello Street
> Sheffield   S1 4DP   UNITED KINGDOM
> Web: http://www.dcs.shef.ac.uk/~kiffer/
> Tel: +44(0)114-22.21967  Fax: +44 (0)114-22.21810
> Skype: christopherbrewster
> SkypeIn (UK): +44 (20) 8144 0088
> SkypeIn (US): +1 (617) 381-4281
> *****************************************************
> Corruptissima re publica plurimae leges. Tacitus. Annals 3.27
>
>
>
>
> On 31 Aug 2008, at 17:57, Miles Osborne wrote:
>
> Alex's question was about the ngram release, not about the counts you get
> from using Google's search engine.
> The ngram release has very light tokenisation and indeed, "won't" appears as
> a single token.
>
> I did a side-check, looking at the frequencies of "won't" and related
> contractions in the GigaWord corpus.  The frequencies within the GigaWord
> release are in line with expectations.
>
> So unless there is something odd about the sample of Web pages used to
> created the Ngram release, there is a problem with some of the counts.
>
> Miles
>>
>
> On Sat, 30 Aug 2008, James L. Fidelholtz wrote:
>
>
>> Well, I don't have *the* solution to your problem necessarily, but I have
>> noticed that Google definitely tends to disregard punctuation (*not*
>> spaces), and so would probably
>> treat A.M. and a.m. and a.m and am. and am all as the same, for example
>> (you can also notice this in responses to certain queries). On the other
>> hand, your query and your
>> examples do not jibe, insofar as you give "won't" with 37K responses, but
>> "wont" with 3.7M responses (which latter *must* include mostly actual
>> "won't", since archaic 'wont'
>> would hardly occur that many times, even in Google's 9.8 gazillion pages
>> (I don't have much familiarity with the ngram corpus, but I assume it is
>> derived from the normal
>> Google 'corpus'). More to the point, from your very query, one wouldn't
>> expect *any* responses in your corpus to "won't".
>>
>> Jim
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list