[Corpora-List] Keyness across Texts

Przemek Kaszubski przemka at amu.edu.pl
Mon Jul 9 15:28:25 UTC 2007


Hello,

Mike Scott's perspective of keywords is mostly textual, as he explains, 
rather than corpus-oriented. The same goes for 'aboutness', as I 
understand it. I think determining 'aboutness' in a corpus will largely 
depend on how many texts you have in your corpus, and how diversified 
you reckon them to be. Based on some empirical fidgeting you may first 
need to figure out the optimum frequency and text range thresholds for 
your keywords. The former is easy to get - WordSmith Keywords will show 
you the corpus frequency for each key word; the latter - less so: you 
might need to generate "consistency counts" for all the words in the 
main (and perhaps also reference) corpora and then use a database 
program to match the keyword and consistency lists into a single 
spreadsheet/matrix. You can then filter a list of keywords matching your 
threshold criteria and then, empirically again, work out a custom 
formula for "enhanced" keyness, e.g. by combining keyness values with 
text range by dividing one by the other, etc.

Text range alone may be an insufficient adjustment, though. Words can be 
spread over a decent % od texts and still show extreme burstiness in a 
minority of texts - you will probably want to know this. Therefore, it 
makes sense to work keyness, range *and* a measure of centrality like 
standard deviation (or variance) of per-text frequency into your 
formula. Adam Kilgarriff once used variance/mean for ranking some of the 
BNC wordlists, which I often resort to as a useful measure of a word's 
'coreness' in a corpus.

Amateurish as they are, I find that some of my keyness formulae working 
nicely -- separating words with a high comparative frequency AND stable 
recurrence in many files from those which are also key but behave more 
erratically. I tend to refer to the former as 'technical' as opposed to 
'topical' words, since I work with pretty homogenous collections. As I 
said, balancing the keyness index against range/variance may be 
corpus-specific: sometimes you will want to tweak your formula to favour 
keyness (frequency), sometimes your slightly higher priority will be 
range/consistency.

Just like you, I will be happy to read more about (relatively 
uncomplicated) statistical solutions that may have been in use.

Regards,

Przemek


Hunter, Duncan wrote (2007-07-09 14:29):
>
> Hello Colleagues!
>
> A question about ‘key-ness’, and key words, in a group of texts…
>
> I’ve been mulling over some ‘key-ness’ statistics for a selection of 
> texts I’ve been studying and a rather odd question has occurred to me….
>
> I’ve been attempting to discover something of the thematic content or 
> ‘about-ness’ of a group of texts by using a keywords analysis, 
> comparing the word frequency profile of the selection of texts with a 
> comparative group to derive ‘key-ness’ (via log-likelihood) stats for 
> each word.
>
> The key-ness value returned by such a procedure can be misleading 
> because of the problem of dispersal; is the word ‘key’ because it 
> occurs in a lot of text samples in the corpus or because of a very 
> high usage in only a single text or small group of texts?
>
> It occurs to me; would it be possible to formulate some kind of 
> measure of a word’s ‘overall key-ness’ in the set of texts we are 
> studying? By multiplying together the word’s key score by the number 
> of texts in which it is key, for example. Of course the resulting 
> figure in this case would be totally arbitrary in a sense-even in the 
> non-parametric realm of corpus comparison measurement it would not 
> really ‘mean’ anything beyond its own description...
>
> However it seems to me useful to have some kind of quantitative means 
> of describing a word’s significance across a range of texts in some 
> way…Any ideas? I am a relative 'newbie' in this field, surely this 
> issue has been tackled by somebody else somewhere? !
>
> All the best,
>
> Duncan Hunter
>

-- 
Dr Przemyslaw Kaszubski
+48 61 8293515

PICLE EAP LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html

CORPUS LINGUISTICS BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka

MY CORPUS LINGUISTICS SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm

EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing

=======================================
School of English (IFA)
Adam Mickiewicz University
http://ifa.amu.edu.pl
=======================================



More information about the Corpora mailing list