Corpora: Subjective familiarity and objective frequency counts

Adam Kilgarriff Adam.Kilgarriff at itri.brighton.ac.uk
Wed Sep 6 15:36:35 UTC 2000


Bruce,


 > Does anyone know of research examining the correlation between subjective
 > assessments of familiarity/frequency (i.e., how often do you see, hear,
 > read, write, speak this word?) and objective frequency counts (based on
 > large corpora)?

 you tackle a big topic here.

You might say frequncy is only of interest because it serves as a
proxy for salience (aka subjective assessments of familiarity) - which
cannot be straightforwardly measured.  So, the critical thing is,
where does corpus freq. fall down as a good proxy for salience.

Just as there are severe limits to how far you can go with the notion
of a corpus being representative, so there are limits to 'salience' -
different words are salient to different degrees for different people,
and you really couldn't get professional cricketers and computer
programmers to agree on the relative salience of "stump" and
"interface". "Representative of what" translates to "salient for whom?"
The moral: don't take corpus frequencies too seriously.  Beyond the
first few thousand items, a small change in sampling policy will
produce quite different frequency lists.

One interesting proposal is that a corpus of children's language (or
language written for children) is a source of frequencies that will
better correspond to salience, than a corpus of adult language.
Compare "thumb" (BNC count: 1,363) and "government" (BNC count: 66,894).
In any corpus aiming at anything like representativeness, "government"
will be far more frequent.  Arguably, "thumb" is more salient -
presenting as it does a clear, simple image, familiar to every member
of the language community from a very early age.  This relates to there
being more children's stories featuring thumbs than governments, and
to it being closer to a cognitive-psychology "basic level object", and
also to the order in which we learn words, and thereby how deep they
lie in our conceptualisation of the world.

Then there are snags like derivational morphology: "quick" (adj) has
BNC freq 5,920 whereas "quickly" has 12,381, but it's perverse to
argue that "quickly" is more salient. Indeed, wghhat does salinece
attach to: words, stems, or (in the other direction) word senses?

At Longman, we certainly thought long and hard about these issues
before deciding to publish frequency band info (In LDOCE 3, 1995) and
when deciding how to implement the ordering of senses: "most
frequent first" vs. "most salient first".

Psycholinguistics argue they can measure salience with, eg, time taken
in lexical decision tasks, and that this is closer to the
psychological truth than corpus frequencies.  But the data is
expensive to come by and still leaves lots of questions unanswered.
They do have lots of experience of experimental paradigms in this
territory (see many issues of Jnl of Psycholinguistics, work by
Tanenhaus and Seidenberg among many others).  I don't know of
published work outside that paradigm on the topic.

Depending, as ever, on corpus composition, raw frequency is often a
less good proxy for salience than document frequency -- number of docs
a word occurs in -- since it curbs the worst excesses of
low-salience words occurring with high frequency because they are used
a lot in a single specialised document.  However, there are also
general patterns whereby verbs and prepositions are more evenly spread
through the language than nouns and pronouns, so counting doc
frequency will tend to push verbs and prepositions higher up the freq
list relative to nouns and pronouns - who's to say whether that's a
good thing or not!

 >
 > I know of one such paper by the psycholinguist Paul Luce (I will provide
 > the reference when I can find it again). Any other pointers or comments on
 > this issue are welcome.
 >
 > My specific interest is in the relationship between the prescribing
 > frequency of specific drugs (drug names) and health professionals'
 > subjective familiarity with those same names. This kind of information is
 > very important in  psycholinguistic research, where the effects of word
 > frequency can be quite overpowering.
 >
 > On a related note, I'd appreciate pointers to any corpus of
 > medical/pharmaceutical/nursing literature that might serve as the basis for
 > an empirical count of drug names in the professional or scientific press.


not sure this would be worth much - what would be the population from
which you would ideally draw your sample? - and, can you get anything
like that in reality?  Minor differences will produce drastically
different frequencies of drug names, simply depending on which
specialisations get represented in your corpus.

Adam

--
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



More information about the Corpora mailing list