[Corpora-List] Re: rare words

FIDELHOLTZ DOOCHIN JAMES LAWRENCE jfidel at siu.buap.mx
Wed Jun 18 16:12:54 UTC 2003


N M Chipere wrote:

> Is anyone familiar with the issues surrounding the definition and
> measurement of word rarity? My colleagues and I are currently treating
> the first two  thousand most frequent words in English as common words and
> the rest as rare (excluding proper nouns, numerals, etc). Apart from the
> issue of where one puts the cut-off point, there is an obvious problem to
> do with homographs, for which we don't have a simple solution.
> Ngoni
>
> *********************************************************************
> Dr Ngoni Chipere
> Institute of Education
> The University of Reading
> Reading
> Berkshire RG6 1HY
>
> tel: 0118 987 5123 x 4943
> **********************************************************************

Hi Ngoni,

Well, what's 'rare' depends on what you are doing with it, or on your
perspective, or both, and/or other things.  For example, in a 1975 article
('Word frequency and vowel reduction in English', Chicago linguistic
society. Annual meeting. Papers 11.200-213), I found that in a certain
environment (first syllable, before consonant clusters not beginning with a
nasal or whose second member is a liquid), reduction of unstressed lax
vowels occurred in 'frequent' words, where 'frequent' is defined as
occurring over about 5 times per M words (I think rather more than the first
2000 most frequent words--I used Thorndike & Lorge for frequency counts).
In the same environment, but before clusters with an initial nasal
consonant, the same thing happens, but 'frequent' for this environment is
much higher, probably well over 50/M, which probably corresponds to fewer
than the first 1000 most frequent words (I haven't checked out the
correspondences exactly between 'most frequent' and 'N per million').  In
other cases (eg unstressed vowels before clusters between stressed
syllables), reduction is much easier, and even general except for some
homonymy issues in relatively rare words, eg 'ex_or_cize' (usually no
reduction before the movie Exorcist came out) vs. 'ex_er_cize' (always
reduced).

Homographs for the first few thousand most frequent words can be roughly
checked for the frequency of their 'parts' by checking a dittoed work by
Lorge & Thorndike (Lorge, Irving and Edward L. Thorndike.  1938.  A
sernantic count of English words.  NY: The Institute of Educational
Research, Teachers College, Columbia University).  This had a run of about
100 copies and can be found in major libraries (I believe the British
National Library, or whatever it's called, has one).  A derivative work from
L&T is: West, Michael P. (compiler & ed.) 1953.  A general service list of
English words with semantic frequencies.  NY: Longmans, Green & Co.

By the way, these frequencies in T&L and L&T are obviously close to 70 years
old.  I don't think that matters much, since such relative frequencies A)
change very slowly, as far as I can tell; and B) are pretty heavily
corpus-dependent, anyway.  Still, there are much more recent things around
if you're worried about that stuff.

Well, I hope this is some help.  All in all, not an easy problem, and very
dependent on your aims.

Jim

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Benemérita Universidad Autónoma de Puebla     MÉXICO



More information about the Corpora mailing list