[Corpora-List] Google ngram corpus: problem with won't

Alex Clark alexc at cs.rhul.ac.uk
Sat Aug 30 15:28:15 UTC 2008


In the ngram corpus, A.M. a.m. etc are all treated differently.
e.g.
a.m. 369995
A.M. 28311
am	296583356
AM	261747431
Am	17935849
aM	169567

You are quite correct that archaic "wont" as in "as is my wont" accounts 
for only a fraction of the counts of "wont", pehaps 60K, the rest being 
misspellings of "won't".

And yes, if it is tokenised with the normal Penn method, I would expect 
there to be no occurrences of "won't"; though I suppose they could arise 
through being broken off from some larger token.

Alex



On Sat, 30 Aug 2008, James L. Fidelholtz wrote:

> Well, I don't have *the* solution to your problem necessarily, but I have noticed that Google definitely tends to disregard punctuation (*not* spaces), and so would probably
> treat A.M. and a.m. and a.m and am. and am all as the same, for example (you can also notice this in responses to certain queries). On the other hand, your query and your
> examples do not jibe, insofar as you give "won't" with 37K responses, but "wont" with 3.7M responses (which latter *must* include mostly actual "won't", since archaic 'wont'
> would hardly occur that many times, even in Google's 9.8 gazillion pages (I don't have much familiarity with the ngram corpus, but I assume it is derived from the normal
> Google 'corpus'). More to the point, from your very query, one wouldn't expect *any* responses in your corpus to "won't".
>  
> Jim
> 
> On Sat, Aug 30, 2008 at 5:58 AM, Alex Clark <alexc at cs.rhul.ac.uk> wrote:
>
>       I have noticed that there seems to a problem with the
>       treatment of "won't" in the Google ngram corpus. We would expect it to
>       occur about 100 million times, but it seems to have disappeared or be
>       tokenized in a non-standard way. We would expect it to appear in the Penn
>       style as "wo" + "n't"
>
>       As a unigram we get
>
>       won't   37251
>       wont 3677346
>       wo      1226869
>
>       in the bigrams we get things like
>
>       I 'm    188587483
>
>       but nothing that I can find that corresponds to "won't".
>
>       Has anyone else noticed this?
>
>       regards
>
>       Alex
>
>       --
>       Alexander Clark     alexc at cs.rhul.ac.uk
>       http://www.cs.rhul.ac.uk/home/alexc/
>       Lecturer, Department of Computer Science,
>       Royal Holloway, University of London, Egham, Surrey TW20 0EX
>       Direct 01784 443430 Department 01784 434455 Fax 01784 439786
>
>       _______________________________________________
>       Corpora mailing list
>       Corpora at uib.no
>       http://mailman.uib.no/listinfo/corpora
> 
> 
> 
> 
> --
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y
> Humanidades
> Benemérita Universidad Autónoma de
> Puebla, MÉXICO
> 
>

--
Alexander Clark     alexc at cs.rhul.ac.uk
http://www.cs.rhul.ac.uk/home/alexc/
Lecturer, Department of Computer Science,
Royal Holloway, University of London, Egham, Surrey TW20 0EX
Direct 01784 443430 Department 01784 434455 Fax 01784 439786
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list