[Corpora-List] On the use of Google ngrams

Wed Feb 9 11:24:45 UTC 2011

Hi,

related to the question of what use validated corpora have now that we  
can all download and analyse millions of words from the internet, you  
may be interested in the article we just published about the use of  
the Google ngrams for psycholinguistic research on word recognition:

http://www.frontiersin.org/language_sciences/abstract/9569

In a nutshell we show that the Google word frequencies (Ngram=1) do  
not correlate well with the lexical decision times from the Elexicon  
Project and other databases. Furthermore, the correlations decrease  
for older books. At first sight, the latter is good news. However, we  
also see that 2005+ frequencies are better predictors for experiments  
run in 1990, suggesting that part of the quality difference is due to  
the types of books included in the Google project over the years. So,  
it may be good to keep in mind that word use differences in time to  
some extent are influenced by the fact that the types of books  
included in Google Books may not be constant over years.

Kind regards,

Marc Brysbaert

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora