<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

Dear Duncan<br>

<br>

NLP researchers prefer a statistic based on the "document-frequency" of

a term as opposed to its "corpus frequency". When I originally built a

keywords procedure for WordSmith, however, I used "corpus frequency".

(If we take a hypothetical example of a text about elephants, the idea

is to compare the frequency of the term elephant in that text and

compare it not with the number of documents in the corpus which contain

that term whether once or more often, but with the total accumulated

frequency in the reference corpus of that term.) <br>

Since WordSmith 4, however, there has been the possibility of knowing

each key-word's document frequency (the header column of the word-list

from which it is derived calls this "Texts"), so I could incorporate a

chance for users a) to see this for each KW, b) to sort on it. <br>

<br>

I doubt whether the current keyness multiplied by the Texts column

("consistency" as I otherwise call it, and Nation calls it "range")

would be useful though; I would think it better to consider keyness as

a feature of the term in that sub-corpus or single text, with the

chance to filter or re-sort according to consistency. For example as

you know I find IT and DO to be key in certain Shakespeare texts. They

are both extremely consistent terms. The keyness as a number is not a

very good indicator since terms which are rare in the language come out

more key than those which are more frequent. I regard it more like a

threshold. If it gets over, it's key. Then we can secondarily sort eg.

alphabetically. by consistency, by frequency in the sub-corpus or text,

etc.<br>

<br>

Cheers -- Mike<br>

<br>

Hunter, Duncan wrote:

<blockquote

 cite="midF0E9C2CC4BC43644A8502151088E487D58D6D2@ELDER.ads.warwick.ac.uk"

 type="cite">

  <meta http-equiv="Content-Type" content="text/html; ">

  <meta content="MSHTML 6.00.2900.3132" name="GENERATOR">

  <div id="idOWAReplyText50502" dir="ltr">

  <div dir="ltr"><font color="#0000ff">

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><font

 color="#000000" face="Times New Roman">Hello Colleagues! </font></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><font

 color="#000000" face="Times New Roman">A question about ‘key-ness’,

and key words, in a group of texts…</font></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span></span> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><font

 color="#000000" face="Times New Roman">I’ve been mulling over some

‘key-ness’ statistics for a selection of texts I’ve been studying and a

rather odd question has occurred to me….</font></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><font

 color="#000000" face="Times New Roman">I’ve been attempting to

discover something of the thematic content or ‘about-ness’ of a group

of texts by using a keywords analysis, comparing the word frequency

profile of the selection of texts with a comparative group to derive

‘key-ness’ (via log-likelihood) stats for each word. </font></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><font

 face="Times New Roman"><font color="#000000">The key-ness value

returned by such a procedure can be misleading because of the problem

of dispersal; is the word <span style="color: black;">‘key’ because it

occurs in a lot of text samples in the corpus or because of a very high

usage in only a single text or small group of texts?</span></font></font></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"></span> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"><font color="#000000" face="Times New Roman">It

occurs to me; would it be possible to formulate some kind of measure of

a word’s ‘overall key-ness’ in the set of texts we are studying? By

multiplying together the word’s key score by the number of texts in

which it is key, for example. Of course the resulting figure in this

case would be totally arbitrary in a sense-even in the non-parametric

realm of corpus comparison measurement it would not really ‘mean’

anything beyond its own description...</font></span></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"></span> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"><font color="#000000" face="Times New Roman">However

it seems to me useful to have some kind of quantitative means of

describing a word’s significance across a range of texts in some way…</font></span><span

 style="color: black;"><font color="#000000" face="Times New Roman">Any

ideas?  <span style="color: black;">I am a relative 'newbie' in this

field, surely this issue has been tackled by somebody else somewhere?</span><span

 style="color: black;"> !</span></font></span></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"><font color="#000000" face="Times New Roman"><span

 style="color: black;"></span></font></span> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"><font color="#000000" face="Times New Roman"><span

 style="color: black;">All the best,</span></font></span></p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"><font color="#000000" face="Times New Roman"><span

 style="color: black;"></span></font></span> </p>

  <p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span

 style="color: black;"><font color="#000000" face="Times New Roman"><span

 style="color: black;">Duncan Hunter</span></font></span></p>

  </font></div>

  </div>

</blockquote>

<br>

<pre class="moz-signature" cols="72">-- 

Mike Scott

***

If you publish research which uses WordSmith, do let me know so I can include it at

<a class="moz-txt-link-freetext" href="http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm">http://www.lexically.net/wordsmith/corpus_linguistics_links/papers_using_wordsmith.htm</a>

***

School of English

University of Liverpool

Liverpool L69 3BX, UK.

<a class="moz-txt-link-abbreviated" href="http://www.lexically.net">www.lexically.net</a>

<a class="moz-txt-link-abbreviated" href="http://www.liv.ac.uk/~ms2928">www.liv.ac.uk/~ms2928</a></pre>

</body>

</html>