[Corpora-List] Google Books Ngram Corpus Problems

Wed Jul 20 16:17:38 UTC 2011

Dear All,

I have a problem with the Google Books Ngram Corpus. Some files do not
contain properly-formatted lines. For instance, consider the following
lines from googlebooks-eng-all-4gram-20090715-278.csv.zip -- which
should contain 4grams, thus have 8 whitespace-separated fields. Four
for the tokens of the 4gram and four for the year and frequency data.

zcat googlebooks-eng-all-4gram-20090715-278.csv.zip | awk '{if (NF==6)
print $0}' | head
"! ' 	1958	1	1	1
"! ' 	1963	1	1	1
"! ' 	1965	3	3	3
"! ' 	1966	3	3	3
"! ' 	1972	2	2	2
"! ' 	1974	1	1	1
"! ' 	1975	2	2	2
"! ' 	1978	1	1	1
"! ' 	1980	1	1	1
"! ' 	1981	2	2	2

They contain only six fields because the ngram is actually a bigram.
There are many examples of this case. In fact, in this particular
file, about one third of the lines do not contain 4grams but trigrams,
bigrams or even unigrams. I couldn't find anyone else complaining
about this issue on the Web. Am I missing something? Did anyone else
have the same problem?

 Thanks,

 Amaç Herdağdelen

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora