[Corpora-List] Keyness across Texts

Mon Jul 9 15:46:26 UTC 2007

Thanks for this, Przemek! I was getting ready to apologise to Duncan for
including the ebrary link in my posting and for all the off-thread messages
it triggered... 

Best... Ute

> -----Original Message-----
> From: owner-corpora at lists.uib.no 
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Przemek Kaszubski
> Sent: Monday, July 09, 2007 5:28 PM
> To: Hunter, Duncan
> Subject: Re: [Corpora-List] Keyness across Texts
> 
> Hello,
> 
> Mike Scott's perspective of keywords is mostly textual, as he 
> explains, rather than corpus-oriented. The same goes for 
> 'aboutness', as I understand it. I think determining 
> 'aboutness' in a corpus will largely depend on how many texts 
> you have in your corpus, and how diversified you reckon them 
> to be. Based on some empirical fidgeting you may first need 
> to figure out the optimum frequency and text range thresholds 
> for your keywords. The former is easy to get - WordSmith 
> Keywords will show you the corpus frequency for each key 
> word; the latter - less so: you might need to generate 
> "consistency counts" for all the words in the main (and 
> perhaps also reference) corpora and then use a database 
> program to match the keyword and consistency lists into a 
> single spreadsheet/matrix. You can then filter a list of 
> keywords matching your threshold criteria and then, 
> empirically again, work out a custom formula for "enhanced" 
> keyness, e.g. by combining keyness values with text range by 
> dividing one by the other, etc.
> 
> Text range alone may be an insufficient adjustment, though. 
> Words can be spread over a decent % od texts and still show 
> extreme burstiness in a minority of texts - you will probably 
> want to know this. Therefore, it makes sense to work keyness, 
> range *and* a measure of centrality like standard deviation 
> (or variance) of per-text frequency into your formula. Adam 
> Kilgarriff once used variance/mean for ranking some of the 
> BNC wordlists, which I often resort to as a useful measure of 
> a word's 'coreness' in a corpus.
> 
> Amateurish as they are, I find that some of my keyness 
> formulae working nicely -- separating words with a high 
> comparative frequency AND stable recurrence in many files 
> from those which are also key but behave more erratically. I 
> tend to refer to the former as 'technical' as opposed to 
> 'topical' words, since I work with pretty homogenous 
> collections. As I said, balancing the keyness index against 
> range/variance may be
> corpus-specific: sometimes you will want to tweak your 
> formula to favour keyness (frequency), sometimes your 
> slightly higher priority will be range/consistency.
> 
> Just like you, I will be happy to read more about (relatively
> uncomplicated) statistical solutions that may have been in use.
> 
> Regards,
> 
> Przemek
> 
> 
> Hunter, Duncan wrote (2007-07-09 14:29):
> >
> > Hello Colleagues!
> >
> > A question about ‘key-ness’, and key words, in a group of texts

> >
> > I’ve been mulling over some ‘key-ness’ statistics for a 
> selection of 
> > texts I’ve been studying and a rather odd question has 
> occurred to me
.
> >
> > I’ve been attempting to discover something of the thematic 
> content or 
> > ‘about-ness’ of a group of texts by using a keywords analysis, 
> > comparing the word frequency profile of the selection of 
> texts with a 
> > comparative group to derive ‘key-ness’ (via log-likelihood) 
> stats for 
> > each word.
> >
> > The key-ness value returned by such a procedure can be misleading 
> > because of the problem of dispersal; is the word ‘key’ because it 
> > occurs in a lot of text samples in the corpus or because of a very 
> > high usage in only a single text or small group of texts?
> >
> > It occurs to me; would it be possible to formulate some kind of 
> > measure of a word’s ‘overall key-ness’ in the set of texts we are 
> > studying? By multiplying together the word’s key score by 
> the number 
> > of texts in which it is key, for example. Of course the resulting 
> > figure in this case would be totally arbitrary in a 
> sense-even in the 
> > non-parametric realm of corpus comparison measurement it would not 
> > really ‘mean’ anything beyond its own description...
> >
> > However it seems to me useful to have some kind of 
> quantitative means 
> > of describing a word’s significance across a range of texts in some 
> > way
Any ideas? I am a relative 'newbie' in this field, surely this 
> > issue has been tackled by somebody else somewhere? !
> >
> > All the best,
> >
> > Duncan Hunter
> >
> 
> --
> Dr Przemyslaw Kaszubski
> +48 61 8293515
> 
> PICLE EAP LEARNER CORPUS ONLINE:
> http://www.staff.amu.edu.pl/~przemka/picle.html
> 
> CORPUS LINGUISTICS BIBLIOGRAPHY:
> http://www.staff.amu.edu.pl/~przemka
> 
> MY CORPUS LINGUISTICS SEMINARS:
> http://www.staff.amu.edu.pl/~przemka/seminars.htm
> 
> EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
> http://www.staff.amu.edu.pl/~przemka/IFA_writing
> 
> =======================================
> School of English (IFA)
> Adam Mickiewicz University
> http://ifa.amu.edu.pl
> =======================================
> 
> 
>