[Corpora-List] Google ngram corpus: problem with won't

Alex Clark alexc at cs.rhul.ac.uk
Sat Aug 30 10:58:03 UTC 2008


I have noticed that there seems to a problem with the 
treatment of "won't" in the Google ngram corpus. We would expect it to 
occur about 100 million times, but it seems to have disappeared or be
tokenized in a non-standard way. We would expect it to appear in the Penn 
style as "wo" + "n't"

As a unigram we get

won't   37251
wont 3677346
wo      1226869

in the bigrams we get things like

I 'm    188587483

but nothing that I can find that corresponds to "won't".

Has anyone else noticed this?

regards

Alex

--
Alexander Clark     alexc at cs.rhul.ac.uk
http://www.cs.rhul.ac.uk/home/alexc/
Lecturer, Department of Computer Science,
Royal Holloway, University of London, Egham, Surrey TW20 0EX
Direct 01784 443430 Department 01784 434455 Fax 01784 439786

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list