[Corpora-List] Google Books Ngram Corpus Problems
Mark Davies
Mark_Davies at byu.edu
Wed Jul 20 17:06:49 UTC 2011
I noticed the same thing as I processed the n-grams for http://googlebooks.byu.edu/. I asked four of the higher-ups at Google Books about this, but I didn't receive an answer. The only thing I can imagine is that these are for n-grams that have punctuation, such as:
" , said Billy ==>
--- --- said Billy
But that's just a guess.
Mark D.
============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
________________________________________
From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Amac Herdagdelen [amac at herdagdelen.com]
Sent: Wednesday, July 20, 2011 10:17 AM
To: corpora at uib.no
Subject: [Corpora-List] Google Books Ngram Corpus Problems
Dear All,
I have a problem with the Google Books Ngram Corpus. Some files do not
contain properly-formatted lines. For instance, consider the following
lines from googlebooks-eng-all-4gram-20090715-278.csv.zip -- which
should contain 4grams, thus have 8 whitespace-separated fields. Four
for the tokens of the 4gram and four for the year and frequency data.
zcat googlebooks-eng-all-4gram-20090715-278.csv.zip | awk '{if (NF==6)
print $0}' | head
"! ' 1958 1 1 1
"! ' 1963 1 1 1
"! ' 1965 3 3 3
"! ' 1966 3 3 3
"! ' 1972 2 2 2
"! ' 1974 1 1 1
"! ' 1975 2 2 2
"! ' 1978 1 1 1
"! ' 1980 1 1 1
"! ' 1981 2 2 2
They contain only six fields because the ngram is actually a bigram.
There are many examples of this case. In fact, in this particular
file, about one third of the lines do not contain 4grams but trigrams,
bigrams or even unigrams. I couldn't find anyone else complaining
about this issue on the Web. Am I missing something? Did anyone else
have the same problem?
Thanks,
Amaç Herdağdelen
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list