[Corpora-List] Google Ngram count

Miles Osborne miles at inf.ed.ac.uk
Sun Aug 31 16:57:57 UTC 2008


Alex's question was about the ngram release, not about the counts you get
from using Google's search engine.
The ngram release has very light tokenisation and indeed, "won't" appears as
a single token.

I did a side-check, looking at the frequencies of "won't" and related
contractions in the GigaWord corpus.  The frequencies within the GigaWord
release are in line with expectations.

So unless there is something odd about the sample of Web pages used to
created the Ngram release, there is a problem with some of the counts.

Miles
>

On Sat, 30 Aug 2008, James L. Fidelholtz wrote:

>* Well, I don't have *the* solution to your problem necessarily, but I have
noticed that Google definitely tends to disregard punctuation (*not*
spaces), and so would probably *
>* treat A.M. and a.m. and a.m and am. and am all as the same, for example
(you can also notice this in responses to certain queries). On the other
hand, your query and your *
>* examples do not jibe, insofar as you give "won't" with 37K responses, but
"wont" with 3.7M responses (which latter *must* include mostly actual
"won't", since archaic 'wont' *
>* would hardly occur that many times, even in Google's 9.8 gazillion pages
(I don't have much familiarity with the ngram corpus, but I assume it is
derived from the normal *
>* Google 'corpus'). More to the point, from your very query, one wouldn't
expect *any* responses in your corpus to "won't". *
>*   *
>* Jim *
-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20080831/7926064d/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list