[Corpora-List] unsupervised with semi-supervised

Eric Atwell eric at comp.leeds.ac.uk
Sun Apr 20 08:58:59 UTC 2008


Taras,

we've just discussed this issue at a Leeds language research seminar, so I
have some refs to hand...

I guess you've already tried Wikipedia:

'semi-supervised learning is a class of machine learning techniques that
make use of both labeled and unlabeled data for training - typically a
small amount of labeled data with a large amount of unlabeled data'
- by this definition, even 2 seed words is semi-supervised.

The PASCAL Morphochallenge (unsupervised morpheme analysis) website says
'unsupervised learning .... means that the program cannot be
explictly given a training file containing "example answers", and nor
can example answers be hard-coded into the program.' 
However there are other ways to include linguistic knowledge in the
system: '... one sees solutions where people make lots of "hard-coded"
assumptions about word structure, e.g., stem-final vowels that can be
dropped etc., so at some point one wonders where to draw a border
between entirely unsupervised methods, minimally supervised methods and
so on. Thus, it is important that all such assumptions be explicitly
mentioned when results are reported.'
http://www.cis.hut.fi/morphochallenge2008/faq.shtml

In other words, some researchers make a distinction between 'supervision'
(a training file containing example answers, eg words and their analyses),
and 'prior knowledge', linguistc knowledge built into the algorithm.
See the website for PASCAL workshop on prior knowledge for text and
language processing http://prior-knowledge-language-ws.wikidot.com/
'... the domain of text and language processing thus appears as a very
promising field for studying the interactions between prior knowledge
and raw training data ...'
So 'a manually created word-list consisting of thousand items' is
prior knowledge rather than supervision.

Prior knowledge is actually orthogonal to the supervised/unsupervised
divide, as prior knowledge can also be used (or avoided) in supervised
machine learning systems, eg
Collobert R and Weston J. 2007. Fast Semantic Extraction Using a
Novel Neural Network Architecture. Proc ACL pp 560-567.
http://www.aclweb.org/anthology-new/P/P07/P07-1071.pdf

I you want more answers to your question, maybe you should attend the
PASCAL Morphochallenge and/or Prior Knowledge workshops!



Eric Atwell, Leeds University


On Sat, 19 Apr 2008, Taras Zagibalov wrote:

> Dear colleagues!
> I've been trying to find good definitions for supervised,
> semi-supervised and unsupervised machine learning in Computational
> Linguistics and NLP. I am especially interested in a good explanation of
> the difference between  unsupervised and semi-supervised learning: to my
> mind there must be some formally stated difference between a system that
> uses only two seed words and a system that uses a manually created
> word-list consisting of thousand items.
> I will be thankful for all ideas regarding the problem
>
> Best regards,
> Taras Zagibalov
> University of Sussex
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
Eric Atwell,
  Senior Lecturer, Language research group leader, School of Computing,
  Faculty of Engineering, UNIVERSITY OF LEEDS, Leeds LS2 9JT, England
  TEL: 0113-3435430  FAX: 0113-3435468  WWW/email: google Eric Atwell

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list