[Corpora-List] Frequency lists - summary
CRuehlemann at aol.com
CRuehlemann at aol.com
Mon Mar 2 17:37:51 UTC 2009
Dear All
Here’s a summary of the responses to my query on word frequency lists other
than Kilgarriff’s at _http://www.kilgarriff.co.uk/bnc-readme.html_
(http://www.kilgarriff.co.uk/bnc-readme.html) (derived from the BNC) and the ones
discussed in
Leech, G., P. Rayson and A. Wilson. (2001). Word Frequencies in Written and
Spoken English: Based on the British National Corpus. London: Longman
(derived from the BNC)
and in
McCarthy, M. J. (1998). Spoken Language and Applied Linguistics. Cambridge:
Cambridge University Press (derived from the Cambridge International
Corpus).
Specifically, I was asking for (i) more word frequency lists available
either in print or online and (ii) references to research discussing the, which
tops most frequency lists derived from general corpora, in terms of reference
(anaphoric, cataphoric, etc.).
There was only one response to (ii) by Steve Coffey, who has done research
on the indefinite articles a/an. Intriguingly, I found a really amazing
analysis of the use of the and the definite noun phrase (NP) that it goes with in
Biber et al. (1999: 263 ff.), where the authors not only outline the
different reference patterns of definite NPs (viz. anaphoric, indirect anaphoric,
cataphoric, situational, generic, and idiomatic) but calculate the proportions
these reference patterns obtain in four registers (viz. Conversation, Fiction,
News, and Academic Writing).
There were a number of useful responses to (i):
Paul Rayson pointed to the companion website for the Leech et al book at:
_http://ucrel.lancs.ac.uk/bncfreq/_ (http://ucrel.lancs.ac.uk/bncfreq/)
and references therein to other earlier frequency lists at:
_http://ucrel.lancs.ac.uk/bncfreq/samples/foreword.pdf_
(http://ucrel.lancs.ac.uk/bncfreq/samples/foreword.pdf)
John D. Burger and Stefan Evert mentioned the Google language modeling data,
based on over a
trillion words worth of web pages at
_http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html_
(http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html)
Angus B. Grieve-Smith contributed information on the frequency list from
the Brown Corpus of written American English (ca. 1962) which is available
from the Oxford Text Archive at _http://ota.ahds.ac.uk/headers/0668.xml_
(http://ota.ahds.ac.uk/headers/0668.xml) and also available in print:
Frequency Analysis of English Usage: Lexicon and Grammar
By Winthrop Nelson Francis, Henry Kucera, Andrew W. Mackie
Contributor Henry Kucera, Andrew W. Mackie
Published by Houghton Mifflin, 1982
Mark Davies pointed to frequency lists for American English (based on COCA, a
balanced corpus of nearly 400 million words), TIME Magazine (100m words,
1920s-2000s), Spanish (20m words, 1900s) and Portuguese (20m words, 1900s). Also
available are n-grams for all of these languages (as well as for the BNC) at:
_http://corpus.byu.edu/word_frequency.asp_
(http://corpus.byu.edu/word_frequency.asp)
Adriano Ferraresi mentioned several frequency lists (for English, but also
Italian and German) at: _http://wacky.sslmit.unibo.it_
(http://wacky.sslmit.unibo.it/) , with the English lists extracted from ukWaC, a very large web-derived
corpus containing around 2 billion words. See also:
Baroni, Bernardini, Ferraresi, Zanchetta (in print). "The wacky wide web: a
collection of very large linguistically processed web-crawled corpora".
Language resources and evaluation.
Finally, I should like to mention the Bank of English-derived word frequency
list in:
Sinclair, J. McH. (1999). 'A way with common words.' In: H. Hasselgard and
S. Oksefjell (eds.) Out of Corpora: Studies in honour of Stig Johansson.
Amsterdam/Rodopi, pp. 157-179.
Many thanks for all contributions
Chris
-------------------------------------------------
Dr. Christoph Rühlemann
Ludwig-Maximilians-University, Munich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090302/2cc1a158/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list