Corpora: Frequency Meaning

Sun Feb 20 04:00:29 UTC 2000

On Thu, 17 Feb 2000, Pascual Cantos wrote:

>I was recently wondering about the usefulness of using frequency data in
>order to classify types/lemmas in various frequency layers, say:
>
>	- Very Low
>	- Low
>	- Moderate
>	- High
>	- Very High
>
>What criteriae would you suggest to carry out this?

Dear Pascual:
	I don't have any very recent info for you, but I did publish an
article in 1976 on English vowel reduction, which contains some
suggestive data for part of your question (at least for English,
although I would have to be convinced that frequency phenomena are
significantly different in this regard for different languages).  Now,
there is a pretty clear dividing line at about 4/M (plus or minus about
3/M) between words with reduced vowels in certain environments, and
vowels unreduced in those environments (of course, the more frequent
words show a greater tendency toward reduction).  It seems to me that
this would probably correspond to the difference between 'medium' and
'low', but a lot depends on how you define these categories.  Here, the
evidence is overwhelmingly strong, in my opinion.  There is some fairly
weak evidence (from other environments with relatively few examples) for
another dividing line somewhere around 35-50/M, which might correspond
to the 'moderate'/'high' division, although my feelings are less strong
on various aspects of this decision.
	No doubt others will have different ideas on what these
differences correspond to, based on totally different analyzed data, but
maybe we can get at some consensus about what these categories (or a
smaller number of categories, perhaps) might correspond to
psychologically.  This last word is important, as there seem to exist
various factors which may make a relatively infrequent word
psychologically more salient, or vice versa (eg, 'berserk' is actually
almost never encountered in the earlier, pre-computer word counts
[corpora of a few hundred Kwords to about 18 Mwords], and nevertheless
acts phonologically in some ways like a 'medium' frequent word--there is
something about its phonological shape [apparently] which makes it
extremely salient for English speakers.
	By the way, there is also some evidence in the article which
calls into question whether, in at least some cases, nonautomatic
morphophonemic alternation may produce distinct lexical entries, for at
least some effects (specifically, the first vowel in the verb 'mistake'
reduces, but the past tense 'mistook' usually has the first vowel
unreduced, since the two forms fall on opposite sides of the
'familiar/unfamiliar' frequency dividing line).  It is data like these
that make me interested in frequency counts of forms rather than
lexemes.

	The article reference is as follows:
Fidelholtz, James L. 1975. Word frequency and vowel reduction in
English. _Chicago linguistic society. Regional meeting. Papers_
11.200-213.
	At some point in the future, there will be an electronic version
of this article available on the Web, but I can't promise when.  I will
let you know when it is available.
	Jim

James L. Fidelholtz			e-mail: jfidel at siu.buap.mx
Maestría en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO