[Corpora-List] unsupervised with semi-supervised

Thu Apr 24 11:30:49 UTC 2008

Taras,

Here is my opinion, for what it's worth...

Firstly, as some other people noted, I think it's somewhat confusing to 
think about the positioning of rule-based systems in the context of 
supervised/semi-supervised/unsupervised methods, since the degree of 
supervision is fundamentally a way of describing the behavior of a 
method which induces classification rules from labelled examples. I 
agree that the issue is somewhat complicated if the rules are used to 
label examples, and these are then used -- however, I believe there are 
then issues of the integrity of the training examples and so on. If all 
rule-sets produce perfect labelling of examples, I don't see that the 
absolute number of rules is an issue in the degree of supervision. If 
not, it strikes me that the purity of the labelled sets is the key 
issue, not the number of rules which were used to create them.

Your other two questions again relate to the *number* of examples used, 
and I don't believe the definitions of the supervision paradigms have 
anything to do with the number of examples (rather the number of 
examples, amongst other things, typically dictates which paradigm to 
use). Whilst there is considerable wiggle-room within the definitions, 
and my feeling is that they are only consensus anyway rather than 
rigorous tests, my understanding of the division is as follows:

unsupervised: no labelled examples
supervised: labelled examples
semi-supervised: mixture of labelled and unlabelled examples

Notice that the semi-supervised problem is trivially related to the 
other two paradigms: remove all labelled examples, and the problem 
becomes unsupervised; remove all unlabelled examples, and it becomes 
supervised.

As I said before, I believe that the number of examples is only of issue 
in deciding which paradigm is most appropriate for the task at hand. For 
example, if labelled examples are hard to come by, but unlabelled ones 
are plentiful, semi-supervised seems appropriate. If no labelled 
examples are available, or the desire is to cluster/describe the data, 
then unsupervised is most appropriate, and so on.

Of course, within the context of semi-supervised learning, there are yet 
more divisions depending upon whether one is interested in inducing a 
classification rule which covers all possible examples, or only those 
examples which are currently unlabelled (transductive classification), 
but I suspect this is a discussion for another time...

Hope this is of some help.

Ben

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora