[Corpora-List] Keyness across Texts

John F. Sowa sowa at bestweb.net
Tue Jul 10 17:03:13 UTC 2007


Przemek,

That is a "key" idea about "keyness":

 > One fantastic feature of KeyWords is of course the possibility
 > of extracting key clusters. One can, for example, group and
 > count those clusters in which specific key words repeat, and
 > this way additionally confirm and contextualize their status,
 > very nicely indeed.

The fundamental issue about any version of "keyness" is the
definition and the algorithms that implement the definition.

The simple definition in terms of frequency counts is the most
widely used because it can be implemented in simple algorithms.
But even then, questions arise about lemmata:  are the algorithms
counting words or lemmata?  And how do the algorithms deal with
lemmata that are lexicalized in different parts of speech?

Clusters can provide a more precise way of defining keyness,
but the number of variations of clustering algorithms is
immense, and each one defines a different version of keyness.

The next step is to apply syntactic and/or semantic techniques
to determine how the words/lemmata are related.  Then the
syntactic and/or semantic structures could be used as input
to the counting and/or clustering methods.

And of course, you could also apply an ontology to relate the
words and/or lemmata.  In fact, you might even use the technology
to extract a document-specific ontology from the texts. And then
one could use that ontology for analyzing other documents.

In short, there is no clear distinction between keyness and
any other issues of semantics.  There is nothing wrong with
using simple, special-purpose techniques for addressing a
particular problem, but it is important to recognize their
limitations and their relationships to broader semantic issues.

John Sowa



More information about the Corpora mailing list