[Corpora-List] Keyness across Texts
Przemek Kaszubski
przemka at amu.edu.pl
Mon Jul 9 15:28:25 UTC 2007
Hello,
Mike Scott's perspective of keywords is mostly textual, as he explains,
rather than corpus-oriented. The same goes for 'aboutness', as I
understand it. I think determining 'aboutness' in a corpus will largely
depend on how many texts you have in your corpus, and how diversified
you reckon them to be. Based on some empirical fidgeting you may first
need to figure out the optimum frequency and text range thresholds for
your keywords. The former is easy to get - WordSmith Keywords will show
you the corpus frequency for each key word; the latter - less so: you
might need to generate "consistency counts" for all the words in the
main (and perhaps also reference) corpora and then use a database
program to match the keyword and consistency lists into a single
spreadsheet/matrix. You can then filter a list of keywords matching your
threshold criteria and then, empirically again, work out a custom
formula for "enhanced" keyness, e.g. by combining keyness values with
text range by dividing one by the other, etc.
Text range alone may be an insufficient adjustment, though. Words can be
spread over a decent % od texts and still show extreme burstiness in a
minority of texts - you will probably want to know this. Therefore, it
makes sense to work keyness, range *and* a measure of centrality like
standard deviation (or variance) of per-text frequency into your
formula. Adam Kilgarriff once used variance/mean for ranking some of the
BNC wordlists, which I often resort to as a useful measure of a word's
'coreness' in a corpus.
Amateurish as they are, I find that some of my keyness formulae working
nicely -- separating words with a high comparative frequency AND stable
recurrence in many files from those which are also key but behave more
erratically. I tend to refer to the former as 'technical' as opposed to
'topical' words, since I work with pretty homogenous collections. As I
said, balancing the keyness index against range/variance may be
corpus-specific: sometimes you will want to tweak your formula to favour
keyness (frequency), sometimes your slightly higher priority will be
range/consistency.
Just like you, I will be happy to read more about (relatively
uncomplicated) statistical solutions that may have been in use.
Regards,
Przemek
Hunter, Duncan wrote (2007-07-09 14:29):
>
> Hello Colleagues!
>
> A question about ‘key-ness’, and key words, in a group of texts…
>
> I’ve been mulling over some ‘key-ness’ statistics for a selection of
> texts I’ve been studying and a rather odd question has occurred to me….
>
> I’ve been attempting to discover something of the thematic content or
> ‘about-ness’ of a group of texts by using a keywords analysis,
> comparing the word frequency profile of the selection of texts with a
> comparative group to derive ‘key-ness’ (via log-likelihood) stats for
> each word.
>
> The key-ness value returned by such a procedure can be misleading
> because of the problem of dispersal; is the word ‘key’ because it
> occurs in a lot of text samples in the corpus or because of a very
> high usage in only a single text or small group of texts?
>
> It occurs to me; would it be possible to formulate some kind of
> measure of a word’s ‘overall key-ness’ in the set of texts we are
> studying? By multiplying together the word’s key score by the number
> of texts in which it is key, for example. Of course the resulting
> figure in this case would be totally arbitrary in a sense-even in the
> non-parametric realm of corpus comparison measurement it would not
> really ‘mean’ anything beyond its own description...
>
> However it seems to me useful to have some kind of quantitative means
> of describing a word’s significance across a range of texts in some
> way…Any ideas? I am a relative 'newbie' in this field, surely this
> issue has been tackled by somebody else somewhere? !
>
> All the best,
>
> Duncan Hunter
>
--
Dr Przemyslaw Kaszubski
+48 61 8293515
PICLE EAP LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html
CORPUS LINGUISTICS BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka
MY CORPUS LINGUISTICS SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm
EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing
=======================================
School of English (IFA)
Adam Mickiewicz University
http://ifa.amu.edu.pl
=======================================
More information about the Corpora
mailing list