[Corpora-List] Keyness across Texts

Mon Jul 9 15:49:29 UTC 2007

Dear Duncan

NLP researchers prefer a statistic based on the "document-frequency" of 
a term as opposed to its "corpus frequency". When I originally built a 
keywords procedure for WordSmith, however, I used "corpus frequency". 
(If we take a hypothetical example of a text about elephants, the idea 
is to compare the frequency of the term elephant in that text and 
compare it not with the number of documents in the corpus which contain 
that term whether once or more often, but with the total accumulated 
frequency in the reference corpus of that term.)
Since WordSmith 4, however, there has been the possibility of knowing 
each key-word's document frequency (the header column of the word-list 
from which it is derived calls this "Texts"), so I could incorporate a 
chance for users a) to see this for each KW, b) to sort on it.

I doubt whether the current keyness multiplied by the Texts column 
("consistency" as I otherwise call it, and Nation calls it "range") 
would be useful though; I would think it better to consider keyness as a 
feature of the term in that sub-corpus or single text, with the chance 
to filter or re-sort according to consistency. For example as you know I 
find IT and DO to be key in certain Shakespeare texts. They are both 
extremely consistent terms. The keyness as a number is not a very good 
indicator since terms which are rare in the language come out more key 
than those which are more frequent. I regard it more like a threshold. 
If it gets over, it's key. Then we can secondarily sort eg. 
alphabetically. by consistency, by frequency in the sub-corpus or text, etc.

Cheers -- Mike

Hunter, Duncan wrote:
>
> Hello Colleagues!
>
>  
>
> A question about 'key-ness', and key words, in a group of texts...
>
>  
>
> I've been mulling over some 'key-ness' statistics for a selection of 
> texts I've been studying and a rather odd question has occurred to me....
>
>  
>
> I've been attempting to discover something of the thematic content or 
> 'about-ness' of a group of texts by using a keywords analysis, 
> comparing the word frequency profile of the selection of texts with a 
> comparative group to derive 'key-ness' (via log-likelihood) stats for 
> each word.
>
>  
>
> The key-ness value returned by such a procedure can be misleading 
> because of the problem of dispersal; is the word 'key' because it 
> occurs in a lot of text samples in the corpus or because of a very 
> high usage in only a single text or small group of texts?
>
>  
>
> It occurs to me; would it be possible to formulate some kind of 
> measure of a word's 'overall key-ness' in the set of texts we are 
> studying? By multiplying together the word's key score by the number 
> of texts in which it is key, for example. Of course the resulting 
> figure in this case would be totally arbitrary in a sense-even in the 
> non-parametric realm of corpus comparison measurement it would not 
> really 'mean' anything beyond its own description...
>
>  
>
> However it seems to me useful to have some kind of quantitative means 
> of describing a word's significance across a range of texts in some 
> way...Any ideas?  I am a relative 'newbie' in this field, surely this 
> issue has been tackled by somebody else somewhere? !
>
>  
>
> All the best,
>
>  
>
> Duncan Hunter
>

-- 
Mike Scott

***
If you publish research which uses WordSmith, do let me know so I can include it at
http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm
***
School of English
University of Liverpool
Liverpool L69 3BX, UK.
www.lexically.net
www.liv.ac.uk/~ms2928

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070709/2fdb2008/attachment.htm>