[Corpora-List] German n-gram files
Christian Pietsch
chr.pietsch at googlemail.com
Fri Nov 18 12:55:24 UTC 2011
Dear Naohiro,
as an alternative to the Google Web1T n-gram collection Yannick
referred to, you might want to look at the Google Books n-gram
collection which also includes a massive German dataset, and can be
downloaded directly from http://books.google.com/ngrams/datasets .
Besides genre, there are other, more subtle differences between the
two data collections, e.g. with respect to tokenization and
punctuation handling. In addition to the n-gram and its frequency, the
Books n-gram corpus includes the year of publication for each n-gram,
so your colleague can filter the n-grams according to his or her
definition of “contemporary”.
Regards,
Christian
--
Christian Pietsch <http://purl.org/net/pietsch>
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list