[Corpora-List] Google Books Ngram Corpus Problems

Mark Davies Mark_Davies at byu.edu
Wed Jul 20 17:06:49 UTC 2011


I noticed the same thing as I processed the n-grams for http://googlebooks.byu.edu/. I asked four of the higher-ups at Google Books about this, but I didn't receive an answer. The only thing I can imagine is that these are for n-grams that have punctuation, such as:

" , said Billy           ==>
---  ---  said Billy

But that's just a guess.

Mark D.

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Amac Herdagdelen [amac at herdagdelen.com]
Sent: Wednesday, July 20, 2011 10:17 AM
To: corpora at uib.no
Subject: [Corpora-List] Google Books Ngram Corpus Problems

Dear All,

I have a problem with the Google Books Ngram Corpus. Some files do not
contain properly-formatted lines. For instance, consider the following
lines from googlebooks-eng-all-4gram-20090715-278.csv.zip -- which
should contain 4grams, thus have 8 whitespace-separated fields. Four
for the tokens of the 4gram and four for the year and frequency data.

zcat googlebooks-eng-all-4gram-20090715-278.csv.zip | awk '{if (NF==6)
print $0}' | head
"! '    1958    1       1       1
"! '    1963    1       1       1
"! '    1965    3       3       3
"! '    1966    3       3       3
"! '    1972    2       2       2
"! '    1974    1       1       1
"! '    1975    2       2       2
"! '    1978    1       1       1
"! '    1980    1       1       1
"! '    1981    2       2       2

They contain only six fields because the ngram is actually a bigram.
There are many examples of this case. In fact, in this particular
file, about one third of the lines do not contain 4grams but trigrams,
bigrams or even unigrams. I couldn't find anyone else complaining
about this issue on the Web. Am I missing something? Did anyone else
have the same problem?

 Thanks,

 Amaç Herdağdelen

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list