[Corpora-List] German n-gram files

Christian Pietsch chr.pietsch at googlemail.com
Fri Nov 18 12:55:24 UTC 2011


Dear Naohiro,

as an alternative to the Google Web1T n-gram collection Yannick
referred to, you might want to look at the Google Books n-gram
collection which also includes a massive German dataset, and can be
downloaded directly from http://books.google.com/ngrams/datasets .

Besides genre, there are other, more subtle differences between the
two data collections, e.g. with respect to tokenization and
punctuation handling. In addition to the n-gram and its frequency, the
Books n-gram corpus includes the year of publication for each n-gram,
so your colleague can filter the n-grams according to his or her
definition of “contemporary”.

Regards,
Christian

-- 
  Christian Pietsch <http://purl.org/net/pietsch>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list