[Corpora-List] Finding representative terms

Mon Dec 26 20:13:36 UTC 2005

Hi, Delip,

If I understand it correctly, you are using IDF without
weighting terms with term frequencies (TF)!!

This will surely result in poor performance since terms
which are used in the same number of domains/documents will not
be discriminated from each other. Correct ranking for the
large number of terms will then not be possible.

Including the multiplicative factor TF (term frequency in domains,
Nij), on the other hand, will appreciate frequent terms in a domain/document
and depreciate it in other less frequently used domains.
Using both TF and IDF should improve the performance significantly
from my early experiences, which used IDF-like measure along.

This TF factor also partially resolve your single-domain/document probelm.
Frequent terms are kings if you have only one document (or, in general,
if the DF's are the same.) 

Also, I had tried a refined version of DF (more precisely, revision of
log(DF)), called cross-domain entropy (CDE) or inter-domain entropy (IDE)
(NEITHER relative entropy NOR cross entropy!!), which was then used
to estimate the expected number of domains/documents as E[DF] = 2**CDE.
(The term 'expectation' may be abused in a not-so-rigid way.)

The CDE measure considers the probability of a term
in a domain/document to decide whether one should increment the DF
by one (or only by a fractional time) when the term appears in one
domain/document.

Roughly speaking, if it is a frequent term in a domain/document,
DF tends to be incremented by one, otherwise, only a fractional
count is added to DF.

Such refinement (TF * Inverse E[DF]) consistently results
in some improvement over the TF-IDF term weighting method
in my experiments (for domain-specific word extraction and
document classification). I would like to see if the refinement
consistently gains better performance over TF-IDF in other tasks too.
So you are welcome to refer to this work:

Jing-Shin Chang, "Domain Specific Word Extraction from
    Hierarchical Web Documents: A First Step Toward Building
    Lexicon Trees from Web Corpora," Proceedings of the Fourth
    SIGHAN Workshop on Chinese Language Learning, IJCNLP-05
    (International Joint Conference on Natural Language Processing),
    pp. 64-71, Jeju Island, Korea, October 14-15, 2005.

http://nlp.csie.ncnu.edu.tw/~shin/doc/SIGHAN.2005/DSW.SIGHAN.2005.Camera.Ready.Jing_Shin_Chang+B.pdf

As a final comment, when refering to "representative terms",
it might be more precise to say "domain-specific representative terms"
or "representative terms" in a specific domain, since a term might be sense
ambiguous and may not always be representative in all domains.

For instance, "bank" may be specific/representative in the "finance" domain
(for its high term frequency in that domain), but it may not be as
representative as other terms (like "mountain", "river") when describing
natural scenes (for its relatively lower term frequency).

- Jing-Shin Chang -^^-

> From owner-corpora at lists.uib.no Tue Dec 27 00:55:01 2005
> Date: Tue, 27 Dec 2005 00:20:07 +0800 (CST)
> From: Delip Rao <deliprao at yahoo.com>
> Subject: [Corpora-List] Finding representative terms
> 
> Hi,
> 
> Is there any work that tries to find the most
> important/representative words from a document? I have
> tried using IDF but results were very poor. Also IDF
> does not make sense if we have a single document and
> want to get the most important term(s) out of it.
> 
> Thanks!
> Delip