[Corpora-List] Keyness across Texts

Mon Jul 9 17:22:58 UTC 2007

Dear Duncan,

I am not an expert with this topic, but has a little bit of experience.

As far as I understand, key-ness of a word is a relative concept, and
every statistical method for keywords just finds words which are
(statistically) 'special' in a certain context.

Of course, to find something special, we need to know what is general.
For example, with wordsmith, we have to compare a 'study corpus' to a
'reference corpus' to get keywords of the study corpus.
(the reference corpus is supposed to represent 'general')

Since the key-ness is defined relatively to the general-ness of words
which are to be represented by the reference corpus, the choice of the
reference corpus will greatly affect the keywords to be extracted.

For example, if we want to extract keywords to classify articles from
Wall Street Journal, it would make sense to use the Wall Street
Journal corpus as the reference corpus.

Once, I compared a corpus of Medline abstracts to Wall Street Journal
corpus to see the difference between the two domains. The WordSmith
tool extracted many verbs like 'be' or 'observed' as keywords which
will be hardly accepted as keywords.
Such verbs represents, I think, expressions which are frequently
observed in scientific language 'specially' when compared to
journalistic language.
(I think Scott's example of Shakespear novel demonstrates a similar case.)

This observation indicate that when we try to extract keywords which
will explain the 'content' of a document (or a set of document), we
have to prepare a reference corpus which is homogeneous in every
aspect, e.g. style of the writing, other than in 'content'.

With 'content' also, by choosing the reference corpus properly, we
could 'control' the keywords to be extracted. For example, when we
extract keywords for a group of newspaper articles from Economy
section, the word 'economy' would not be extracted if the reference
corpus is comprised only with articles from Economy section, but would
be extracted if the reference corpus is comprised with articles from
the whole sections.

Hope you find it helpful.

Best Wishes,

Jin-Dong

---
Jin-Dong Kim, Ph.D,
Project Lecturer,
Department of Computer Science,
Graduate School of Information Science and Technology,
University of Tokyo