Corpora: Summary: Frequency Bands

Pascual Cantos Gomez pcantos at fcu.um.es
Fri Feb 25 16:14:02 UTC 2000


Dear Corpus Linguists,

I enclose a summary of all those who answered to my query on criteria to
establish frequency bands.

Many thanks to:

Tony Berber Sardinha
James L. Fidelholtz
Eric Atwell
Ramesh Krishnamurthy

-----------------------------

James L. Fidelholtz:

Dear Pascual:
	I don't have any very recent info for you, but I did publish an
article in 1976 on English vowel reduction, which contains some
suggestive data for part of your question (at least for English,
although I would have to be convinced that frequency phenomena are
significantly different in this regard for different languages).  Now,
there is a pretty clear dividing line at about 4/M (plus or minus about
3/M) between words with reduced vowels in certain environments, and
vowels unreduced in those environments (of course, the more frequent
words show a greater tendency toward reduction).  It seems to me that
this would probably correspond to the difference between 'medium' and
'low', but a lot depends on how you define these categories.  Here, the
evidence is overwhelmingly strong, in my opinion.  There is some fairly 
weak evidence (from other environments with relatively few examples) for
another dividing line somewhere around 35-50/M, which might correspond
to the 'moderate'/'high' division, although my feelings are less strong
on various aspects of this decision.
	No doubt others will have different ideas on what these
differences correspond to, based on totally different analyzed data, but
maybe we can get at some consensus about what these categories (or a
smaller number of categories, perhaps) might correspond to
psychologically.  This last word is important, as there seem to exist
various factors which may make a relatively infrequent word
psychologically more salient, or vice versa (eg, 'berserk' is actually
almost never encountered in the earlier, pre-computer word counts
[corpora of a few hundred Kwords to about 18 Mwords], and nevertheless
acts phonologically in some ways like a 'medium' frequent word--there is
something about its phonological shape [apparently] which makes it
extremely salient for English speakers.
	By the way, there is also some evidence in the article which
calls into question whether, in at least some cases, nonautomatic
morphophonemic alternation may produce distinct lexical entries, for at
least some effects (specifically, the first vowel in the verb 'mistake'
reduces, but the past tense 'mistook' usually has the first vowel
unreduced, since the two forms fall on opposite sides of the
'familiar/unfamiliar' frequency dividing line).  It is data like these
that make me interested in frequency counts of forms rather than
lexemes.

	The article reference is as follows:
Fidelholtz, James L. 1975. Word frequency and vowel reduction in
English. _Chicago linguistic society. Regional meeting. Papers_
11.200-213.
	At some point in the future, there will be an electronic version
of this article available on the Web, but I can't promise when.  I will
let you know when it is available.
	Jim

James L. Fidelholtz			e-mail: jfidel at siu.buap.mx
Maestría en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO

-----------------------------
Tony Berber Sardinha:

Hi Pascual

Would there be something wrong with simply placing 20% of the tokens in each
frequency band and then 'adjusting' the individual percent freqs to fit within
the 20% intervals? I simulated this in the spreadsheet that's attached to this
message.

I ask because I have an interest in this issue as well and I'm positive my
approach is far too naive

abraço
tony.

-----------------------------
Eric Atwell:

Pascual,
one point to remember is Zipf's law of frequency distribution 
of countable things in language.  You may need to use a logarithmic scale
in classifying into low/medium/high frequency.  For example, many years ago
I worked on the wordlist and suffixlist used in the LOB Corpus tagging
program, 
classifying word-tags with words and suffixes on a logarithmic scale:
POS-tags were classified common/rare/very-rare, where "rare" meant less 
than 10%, "very rare" meant 1% or less, 
eg water NN VB@  means "water" is usually Noun, about 10% Verb

You need huge data samples to yield frequencies accurate enough to give 
more fine-grained distinctions - I would advise against as many as 5 levels
Very Low/Low/Moderate/High/Very High unless you are confident you can get
enough examples to classify with confidence.

Eric 

Eric Atwell, Distributed Multimedia Systems MSc Tutor & SOCRATES Coordinator
 Centre for Computer Analysis of Language And Speech (CCALAS)
 School of Computer Studies, Faculty of Engineering,
 University of Leeds, LEEDS LS2 9JT, England
 EMAIL: eric at scs.leeds.ac.uk  TEL: (44)113-2335430  FAX: (44)113-2335468
 WWW: http://www.scs.leeds.ac.uk/eric 

---------------------------

Ramesh Krishnamurthy:

Dear Dr Gomez
Cobuild used corpus lemma frequencies in their Dictionary (2nd edition,
1995). We devised a 5-band distinction, with 700 lemmas in the most
frequent band, 1200 in the 2nd band, 1500 in the 3rd band, 3200 in the 4th
and 8100 in the 5th. I can't remember the exact frequency cut-offs used,
but I'm confident
that most users of the dictionary have found it a very useful addition.
The exact cut-off points might be affected by the size of the corpus, and 
may also be language dependent (in a highly inflected language like
Spanish, there might be different relationships between some types and
lemmas when compared
to a realtively uninflected language like English). Also the purpose of
your classification may affect your decisions. For a dictionary, lemma is
presumably more important than type, although type distribution within a
lemma may influence whether a form is treated under the main lemma form,
or is given separate headword status (e.g. "situated" in an English dictionary
may be a separate headword, as well as being an inflected form under the
headword "situate"; similarly "painting" and "paint"; word-class shifts
would also
have to be taken into account.).
Hope this helps.
Ramesh

Ramesh Krishnamurthy
Honorary Research Fellow
Corpus Research Group
University of Birmingham



___________________________________________________

Dr. Pascual Cantos Gomez

Departamento de Filologia Inglesa
Universidad de Murcia
C./ Santo Cristo, 1
30071 Murcia - SPAIN

Tel: 968 364365; +34 968 364365
Fax: 968 363185; +34 968 363185
E-mail: pcantos at fcu.um.es



More information about the Corpora mailing list