[Corpora-List] Frequency lists - summary

CRuehlemann at aol.com CRuehlemann at aol.com
Mon Mar 2 17:37:51 UTC 2009


 
Dear All
 
Here’s a summary of the responses to my query on word frequency lists  other 
than Kilgarriff’s at _http://www.kilgarriff.co.uk/bnc-readme.html_ 
(http://www.kilgarriff.co.uk/bnc-readme.html)  (derived from the BNC) and the ones 
discussed in 
 
Leech,  G., P. Rayson and A. Wilson. (2001). Word Frequencies in Written and 
Spoken  English: Based on the British National Corpus. London: Longman 
(derived  from the BNC)
 
and in
 
McCarthy, M. J.  (1998). Spoken  Language and Applied Linguistics. Cambridge: 
 Cambridge University Press (derived from the Cambridge International  
Corpus).
 
Specifically, I was asking for (i) more word frequency lists  available 
either in print or online and (ii) references to research discussing  the, which 
tops most frequency lists derived from general corpora,  in terms of reference 
(anaphoric, cataphoric, etc.).
 
There was only one response to (ii)  by Steve Coffey, who has done research 
on the  indefinite articles a/an. Intriguingly, I found a really amazing  
analysis of the use of the and the definite noun phrase (NP) that it goes  with in 
Biber et al. (1999: 263 ff.), where the authors not only outline the  
different reference patterns of definite NPs (viz. anaphoric, indirect  anaphoric, 
cataphoric, situational, generic, and idiomatic) but calculate  the proportions 
these reference patterns obtain in four registers (viz.  Conversation, Fiction, 
News, and Academic  Writing).
 
There were a number of useful responses to  (i):
 
Paul  Rayson  pointed to the companion website for the Leech et al book at: 
_http://ucrel.lancs.ac.uk/bncfreq/_ (http://ucrel.lancs.ac.uk/bncfreq/)  
and references therein to other earlier frequency lists at:
 _http://ucrel.lancs.ac.uk/bncfreq/samples/foreword.pdf_ 
(http://ucrel.lancs.ac.uk/bncfreq/samples/foreword.pdf) 
John D. Burger and Stefan Evert mentioned the Google language modeling data, 
based on over a 
trillion words worth of web pages at 
_http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html_ 
(http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html)  


Angus  B. Grieve-Smith  contributed information on the frequency  list from 
the Brown Corpus of written American English (ca. 1962) which is  available 
from the Oxford Text Archive at _http://ota.ahds.ac.uk/headers/0668.xml_ 
(http://ota.ahds.ac.uk/headers/0668.xml)   and also available in print:
Frequency Analysis of English Usage: Lexicon and Grammar
By Winthrop Nelson Francis, Henry Kucera, Andrew W. Mackie
Contributor Henry Kucera, Andrew W. Mackie
Published by Houghton Mifflin, 1982
Mark Davies pointed to frequency lists for American English (based on COCA, a 
balanced  corpus of nearly 400 million words), TIME Magazine (100m words, 
1920s-2000s), Spanish (20m words, 1900s) and Portuguese (20m words, 1900s). Also 
available are n-grams for all of these languages (as well as for the BNC) at:
_http://corpus.byu.edu/word_frequency.asp_ 
(http://corpus.byu.edu/word_frequency.asp) 
Adriano Ferraresi mentioned several frequency lists (for English, but also 
Italian and German) at: _http://wacky.sslmit.unibo.it_ 
(http://wacky.sslmit.unibo.it/) , with the English lists extracted from ukWaC, a very large web-derived 
corpus containing around 2 billion words. See also:


Baroni, Bernardini, Ferraresi, Zanchetta (in print). "The wacky wide web: a 
collection of very large  linguistically processed web-crawled corpora". 
Language resources and  evaluation.

Finally,  I should like to mention the Bank of English-derived word frequency 
list  in:
 
Sinclair,  J. McH. (1999). 'A way with common words.' In: H. Hasselgard and 
S. Oksefjell  (eds.) Out of Corpora: Studies in honour of Stig Johansson.  
Amsterdam/Rodopi, pp. 157-179.
 

Many thanks for all  contributions

Chris
-------------------------------------------------
Dr. Christoph Rühlemann
Ludwig-Maximilians-University,  Munich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090302/2cc1a158/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list