190 likes | 452 Vues
Wake-Sleep algorithm for Representational Learning. Hamid Reza Maei Physiol. & Neurosci. Program University of Toronto. V/H. R 2. G 2. G 1. R 1. d. Motivation. The Brain is able to learn the underlying representation of received input data (e.g. images) in an unsupervised manner.
 
                
                E N D
Wake-Sleep algorithm forRepresentational Learning Hamid Reza Maei Physiol. & Neurosci. Program University of Toronto
V/H R2 G2 G1 R1 d Motivation The Brain is able to learn the underlying representation of received input data (e.g. images) in an unsupervised manner. Challenge for neural networks: 1. It needs a specific teacher for desired output 2. It needs training all the connections Wake-Sleep algorithm avoids these two problems
xi X i Gijxy gjy G Y yj j d Logistic belief network Advantage Conditional distributions are factorial:
The inference is intractable Sprinkle Rain Explaining away: Sprinkle and Rain conditionally are dependent Wet Though it is very crude, but let’s approximateP(h|d; G)with a factorial distributionQ(h|d; R). Recognition weight Learning Generative weights
YES! Using Jensen’s inequality we find a lower bound for log likelihood: Free energy Any guarantee for the improvement of learning? Thus, decreasing the free energy increases the lower bound and therefore increases log likelihood. This leads to Wake Phase.
Replaced by Q(h|d; R). Remind:: Wake phase • Get samples (xo and yo) from factorial distribution Q(h|d;R) (bottom-up pass) • use these samples in generative model for changing the generative weights.
Learning recognition weights Derivative of free energy with respect toR ,gives complicated results that computationally is intractable Switch! (KL is not a symmetric function! ) What should be done?! Change the recognition weights to minimize the above free energy. This leads to sleep phase.
Wake phase approximation Sleep phase approximation NO! Q Q -In the sleep phase we are minimizingKL(P, Q) which is wrong! -In the wake phase we are minimizingKL(Q, P)which is right thing to do. P Sleep Phase Sleep phase: 1. Get samples (x●,y●), generated by generative model using data coming from nowhere! 2. Change the recognition connections using the above delta rule. Any guarantee for improvement? (for sleep phase)
We can describe it using Shannon’s coding theory. The wake-sleep algorithm • Wake-phase: • -Use recognition weights to perform a bottom-up pass in order to create samples for layers above (from data). • -Train generative weights using samples obtained from recognition model. • 2. Sleep-phase: • -Use generative weights to reconstruct data by performing a top-down pass. • -Train recognition weights using samples obtained from generative model G2 R2 G1 R1 d What Wake-Sleep algorithm really is trying to achieve?! It turns out that the goal of wake-sleep algorithm is to learn representation that are economical to describe:
Simple example Training: For 4X4 images, we use belief network with one visible layer and two hidden layers (binary neurons): -The visible layer has 16 neurons. -First hidden layer (8 neurons) decides all possible orientations. -The top hidden layer (1 neuron) decides vertical and horizontal bars 2. The network was trained on 2x106 random examples. Hinton et. al. Science (1995)
Wake-sleep algorithm on 20 news group data set • -contains about 20,000 articles. • -many categories fall into overlapping clusters • -we used tiny version of this data set with binary occurrence of 100 words across 16242 posting which could be divided with 4 classes: • comp.* • sci.* • rec.* • talk.*
visible hidden Training • Visible layer: 100 visible units • First hidden layer: 50 hidden units in the first hidden layer • Second hidden layer: 20 hidden units in the top layer. • For training we used %60 of data (9745 training examples) and kept remaining for testing the model (6497 testing examples).
Just for fun! Performance for model- Comp.* (class 1) • `windows',`win',`video',`card',`dos', `memory',`program',`ftp',`help',`system‘ • … Performance for model-talk.* (class 4) • 'god‘, 'bible‘, 'jesus‘, 'question‘, 'christian', 'israel‘, 'religion‘, 'card‘, 'jews' 'email' • `world',`jews',`war',`religion',`god',`jesus', `christian',`israel',`children',`food‘ • …
(Class 4) Presented examples from classes 1 and 4 to the learned wake-sleep algorithm under model comp.* Presented examples from classes 1 and 4 to the learned wake-sleep algorithm under model talk.* (Class 1) Testing (classification) • Learn two different Wake-Sleep algorithm on two different classes 1 and 4; that is comp.* and talk.* respectively. • Present the training examples from classes 1 and 4 to each of the two learned algorithm and compute the following free energy as score under each model.
Naïve Bayes classifier • Assumptions: • P(cj): frequency of classes in the training examples (9745). • Conditional Independence Assumption. • Use Bayes rule • Learn model parameter using Maximum likelihood (e.g. for classes 1 and 4). Correct prediction on testing examples: Present testing example from class 1 and 2 to the trained model and predict which class it belongs to. %80 correct prediction • Most probable words in each class: • Comp.*: • -’windows‘, 'help‘, 'email‘,'problem' 'system‘, 'computer''software’,'program' 'university''drive‘ • Talk.*: • -'fact‘, 'god‘,'government’,'question''world‘,'christian‘,'case''course''state' 'jews' McCallum et. al. (1998)
Conclusion • Wake-Sleep is unsupervised learning algorithm. • higher hidden layers store representations. • Although we have used very crude approximations it works very well on some of realistic data. • Wake-Sleep is trying to describe the representation economical (Shannon’s coding theory).
Flaws of wake-sleep algorithm • Sleep phase has horrible assumptions (although it worked!) -it minimized KL(P||Q) rather KL(Q|P) -The recognition weights are trained not from data space but dream space! *Variation approximations.
Using complementary priors to eliminate explaining away 1. Because of explaining away there .. Remove the correlations in hidden layers—complementary priors etc G GT hi1 H1 G Do complementary priors exist? Very hard questions and not obvious! GT vj1 V1 But it is possible to remove the effect of explaining away using this architecture: G GT hi0 H0 G GT vj0 V0 Inference is very easy Because of factorial distributions Restricted Boltzman Machine: Hinton et al. Neural Computation (2006) Hinton et. al. Science (2006)