[Corpora-List] Google Books Ngram Corpus Problems

Amac Herdagdelen amac at herdagdelen.com
Wed Jul 20 18:14:57 UTC 2011


Thanks Mark. I also noticed that the problematic lines contain
punctuation. In particular, the ones that I have seen all start with
the double quote. I am willing to discard such lines, but if a bigram
line found in a 4gram file actually needs to be aggregated with the
original bigram data that would be too much trouble.

Amaç

On Wed, Jul 20, 2011 at 1:06 PM, Mark Davies <Mark_Davies at byu.edu> wrote:
> I noticed the same thing as I processed the n-grams for http://googlebooks.byu.edu/. I asked four of the higher-ups at Google Books about this, but I didn't receive an answer. The only thing I can imagine is that these are for n-grams that have punctuation, such as:
>
> " , said Billy           ==>
> ---  ---  said Billy
>
> But that's just a guess.
>
> Mark D.
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> ________________________________________
> From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Amac Herdagdelen [amac at herdagdelen.com]
> Sent: Wednesday, July 20, 2011 10:17 AM
> To: corpora at uib.no
> Subject: [Corpora-List] Google Books Ngram Corpus Problems
>
> Dear All,
>
> I have a problem with the Google Books Ngram Corpus. Some files do not
> contain properly-formatted lines. For instance, consider the following
> lines from googlebooks-eng-all-4gram-20090715-278.csv.zip -- which
> should contain 4grams, thus have 8 whitespace-separated fields. Four
> for the tokens of the 4gram and four for the year and frequency data.
>
> zcat googlebooks-eng-all-4gram-20090715-278.csv.zip | awk '{if (NF==6)
> print $0}' | head
> "! '    1958    1       1       1
> "! '    1963    1       1       1
> "! '    1965    3       3       3
> "! '    1966    3       3       3
> "! '    1972    2       2       2
> "! '    1974    1       1       1
> "! '    1975    2       2       2
> "! '    1978    1       1       1
> "! '    1980    1       1       1
> "! '    1981    2       2       2
>
> They contain only six fields because the ngram is actually a bigram.
> There are many examples of this case. In fact, in this particular
> file, about one third of the lines do not contain 4grams but trigrams,
> bigrams or even unigrams. I couldn't find anyone else complaining
> about this issue on the Web. Am I missing something? Did anyone else
> have the same problem?
>
>  Thanks,
>
>  Amaç Herdağdelen
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list