Corpora: Frequency Meaning

ramesh at clg.bham.ac.uk ramesh at clg.bham.ac.uk
Thu Feb 17 22:49:24 UTC 2000


Dear Dr Gomez
Cobuild used corpus lemma frequencies in their Dictionary (2nd edition,
1995). We devised a 5-band distinction, with 700 lemmas in the most frequent band, 1200 in the 2nd band, 1500 in the 3rd band, 3200 in the 4th and 8100 in the 5th. I can't remember the exact frequency cut-offs used, but I'm confident
that most users of the dictionary have found it a very useful addition.
The exact cut-off points might be affected by the size of the corpus, and
may also be language dependent (in a highly inflected language like Spanish, there might be different relationships between some types and lemmas when compared
to a realtively uninflected language like English). Also the purpose of
your classification may affect your decisions. For a dictionary, lemma is
presumably more important than type, although type distribution within a
lemma may influence whether a form is treated under the main lemma form,
or is given separate headword status (e.g. "situated" in an English dictionary
may be a separate headword, as well as being an inflected form under the headword "situate"; similarly "painting" and "paint"; word-class shifts would also
have to be taken into account.).
Hope this helps.
Ramesh

Ramesh Krishnamurthy
Honorary Research Fellow
Corpus Research Group
University of Birmingham



More information about the Corpora mailing list