[Corpora-List] Google Books Ngram Corpus Problems
Amac Herdagdelen
amac at herdagdelen.com
Wed Jul 20 16:17:38 UTC 2011
Dear All,
I have a problem with the Google Books Ngram Corpus. Some files do not
contain properly-formatted lines. For instance, consider the following
lines from googlebooks-eng-all-4gram-20090715-278.csv.zip -- which
should contain 4grams, thus have 8 whitespace-separated fields. Four
for the tokens of the 4gram and four for the year and frequency data.
zcat googlebooks-eng-all-4gram-20090715-278.csv.zip | awk '{if (NF==6)
print $0}' | head
"! ' 1958 1 1 1
"! ' 1963 1 1 1
"! ' 1965 3 3 3
"! ' 1966 3 3 3
"! ' 1972 2 2 2
"! ' 1974 1 1 1
"! ' 1975 2 2 2
"! ' 1978 1 1 1
"! ' 1980 1 1 1
"! ' 1981 2 2 2
They contain only six fields because the ngram is actually a bigram.
There are many examples of this case. In fact, in this particular
file, about one third of the lines do not contain 4grams but trigrams,
bigrams or even unigrams. I couldn't find anyone else complaining
about this issue on the Web. Am I missing something? Did anyone else
have the same problem?
Thanks,
Amaç Herdağdelen
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list