[Corpora-List] Decision tree : maximise recall over precision

Tue Apr 21 15:54:41 UTC 2009

I too would be interested to hear what you end up doing, since I'm also
using decision trees for information retrieval.

I was thinking about the following hack:  duplicate the "yes" training
examples N times. This will prevent the "yes" nodes in the tree from
getting pruned because they contain too few examples compared to the
"no" nodes.    I was actually writing code to do that right now because
I have a problem where I have trustworthy positive examples and less
trustworthy negative examples.

Another approach might be to try clustering, to turn it into a
multi-class problem.  If it turns out there are clusters that only
contain negative examples that you can identify with high precision,
then you can throw out examples classified into those clusters.

Note that it sounds like you actually do not want to maximize recall -
you could trivially do that by simply returning all results in the
corpus.  It might be more helpful to think about maximizing weighted
F-score, where the weight is biased towards recall.

Stefanie

Eric Atwell pisze:
> Enmmanuel,
> 
> Surely a good decision procedure is "JUST SAY NO!" - "only" 99.9% accurate! 
> I wish PoS-taggers and other text annotation tools were as good!
> 
> It sounds like you want to find out how to set a WEKA decision-tree
> builder to NOT prune any branches ... this question is better put to 
> the WEKA mailing list wekalist at list.scms.waikato.ac.nz - see
> https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist to join
> 
> Eric Atwell, Leeds University
> 
> PS - please let me know if you find the answer - this looks like an
> interesting class coursework exercise!
> 
> 
> On Tue, 21 Apr 2009, Emmanuel Prochasson wrote:
> 
>> Dear all,
>>
>> I would like to build a decision tree (or whatever supervised classifier
>> relevant) on a set of data containing 0.1% "Yes" and 99.9% "No", using
>> several attributes (12 for now, but I have to tune that). I use Weka,
>> which is totally awesome.
>>
>> My goal is to prune search space for another application (ie : remove
>> say, 80% of the data that are very unlikely to be "Yes"), that's why I'm
>> trying to use a decision tree. Of course some algorithm returns a 1 leaf
>> node tree tagged "No", with a 99.9% precision, which is pretty accurate,
>> but ensure I will always withdraw all of my search space rather than
>> prune it.
>>
>> My problem is : is there a way (algorithm ? software ?) to build a tree
>> that will maximise recall (all "Yes" elements tagged "Yes" by the
>> algorithm). I don't really care about precision (It's ok if many "No"
>> elements are tagged "Yes" -- I can handle false positive).
>>
>> In other word, is there a way to build a decision tree under the
>> constraint of 100% recall ?
>>
>> I'm not sure I made myself clear, and I'm not sure there are solutions
>> for my problem.
>>
>> Regards,
>>
>>
> 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora