Deep Learning of Representations for Unsupervised and Transfer Learning

Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2nd 2011, Bellevue, WA Deep Learning of Representations for Unsupervised and Transfer Learning

How to Beat the Curse of Many Factors of Variation? Compositionality: exponential gain in representational power • Distributed representations • Deep architecture

Distributed Representations • Many neurons active simultaneously • Input represented by the activation of a set of features that are not mutually exclusive • Can be exponentially more efficient than local representations

Local vs Distributed

RBM Hidden Units Carve Input Space h1 h2 h3 x1 x2

Unsupervised Deep Feature Learning • Classical: pre-process data with PCA = leading factors • New: learning multiple levels of features/factors, often over-complete • Greedy layer-wise strategy: unsupervised Supervised fine-tuning P(y|x) unsupervised unsupervised raw input x raw input x raw input x (Hinton et al 2006, Bengio et al 2007, Ranzato et al 2007) raw input x

Why Deep Learning? • Hypothesis 1: need deep hierarchy of features to efficiently represent and learn complex abstractions needed for AI and mammal intelligence. • Computational & statististical efficiency • Hypothesis 2: unsupervised learning of representations is a crucial component of the solution. • Optimization & regularization. • Theoretical and ML-experimental support for both. • Cortex: deep architecture, the same learning rule everywhere

Deep Motivations • Brains have a deep architecture • Humans’ ideas & artifacts composed from simpler ones • Unsufficient depth can be exponentially inefficient • Distributed (possibly sparse) representations necessary for non-local generalization, exponentially more efficient than 1-of-N enumeration of latent variable values • Multiple levels of latent variables allowcombinatorial sharing of statistical strength task 1 task 2 task 3 shared intermediate representations raw input x

Deep Architectures are More Expressive Theoretical arguments: Logic gates Formal neurons RBF units 2 layers of = universal approximator Theorems for all 3: (Hastad et al 86 & 91, Bengio et al 2007) Functions compactly represented with k layers may require exponential size with k-1 layers … 1 2 3 2n … 1 2 3 n

Sharing Components in a Deep Architecture Polynomial expressed with shared components: advantage of depth may grow exponentially (Bengio & Delalleau, Learning Workshop 2011) Sum-product network

Parts Are Composed to Form Objects Layer 1: edges Layer 2: parts Layer 3: objects Lee et al. ICML’2009

Deep Architectures and Sharing Statistical Strength, Multi-Task Learning • Generalizing better to new tasks is crucial to approach AI • Deep architectures learn good intermediate representations that can be shared across tasks • Good representations make sense for many tasks task 1 output y1 task 2 output y2 task 3 output y3 shared intermediate representation h raw input x

Feature and Sub-Feature Sharing Not sharing intermediate features • Different tasks can share the same high-level features • Different high-level features can be built from the same set of lower-level features • More levels = up to exponential gain in representational efficiency Sharing intermediate features task 1 output y1 task N output yN task 1 output y1 task N output yN … … High-levelfeatures … High-levelfeatures … … … Low-levelfeatures … Low-levelfeatures … … …

Representations as Coordinate Systems • PCA: removing low-variance directions  easy but what if signal has low variance? Wewouldlike to disentanglefactors of variation, keepingthem all. • Overcompleterepresentations: richer, even if underlying distribution concentratesnearlow-dim manifold. • Sparse/saturatedfeatures: allows for variable-dim manifolds. Different few sensitive featuresatx = local chartcoordinate system.

Effect of Unsupervised Pre-training AISTATS’2009+JMLR 2010, with Erhan, Courville, Manzagol, Vincent, S. Bengio

Effect of Depth with pre-training w/o pre-training

Unsupervised Feature Learning: a Regularizer to Find Better Local Minima of Generalization Error • Unsupervised pre-training acts like a regularizer • Helps to initialize in basin of attraction of local minima with better generalization error

Non-Linear Unsupervised Feature Extraction Algorithms • CD for RBMs • SML (PCD) for RBMs • Sampling beyond Gibbs (e.g. tempered MCMC) • Mean-field + SML for DBMs • Sparse auto-encoders • Sparse Predictive Decomposition • Denoising Auto-Encoders • Score Matching / Ratio Matching • Noise-Contrastive Estimation • Pseudo-Likelihood • Contractive Auto-Encoders See my book / review paper (F&TML 2009): Learning Deep Architectures for AI

Auto-Encoders • Reconstruction=decoder(encoder(input)) • Probable inputs have small reconstruction error • Linear decoder/encoder = PCA up to rotation • Can be stacked to form highly non-linear representations, increasing disentangling (Goodfellow et al, NIPS 2009) code= latent features decoder encoder … … input reconstruction

Restricted BoltzmanMachine (RBM) • The most popular building block for deep architectures • Bipartite undirected graphical model • Inference is trivial: P(h|x) & P(x|h) factorize hidden observed

RBMs are Universal Approximators (Le Roux & Bengio 2008, Neural Comp.) • Adding one hidden unit (with proper choice of parameters) guarantees increasing likelihood • With enough hidden units, can perfectly model any discrete distribution • RBMs with variable nb of hidden units = non-parametric

Denoising Auto-Encoder • Learns a vector field towards higher probability regions • Minimizes variational lower bound on a generative model • Similar to pseudo-likelihood • A form of regularized score matching Corrupted input Reconstruction Corrupted input

Stacked Denoising Auto-Encoders • No partition function, can measure training criterion • Encoder & decoder: any parametrization • Performs as well or better than stacking RBMs for usupervised pre-training Infinite MNIST

Contractive Auto-Encoders • Contractive Auto-Encoders: Explicit Invariance DuringFeature Extraction, Rifai, Vincent, Muller, Glorot & Bengio, ICML 2011. • HigherOrder Contractive Auto-Encoders, Rifai, Mesnil, Vincent, Muller, Bengio, Dauphin, Glorot, ECML 2011. • Part of winningtoolbox in final phase of the Unsupervised & Transfer Learning Challenge 2011

Contractive Auto-Encoders • Few active unitsrepresent the active subspace (local chart) • Jacobian’sspectrumispeaked = local low-dimensionalrepresentation / relevant factors • Training criterion: cannot afford contraction in manifold directions wants contraction in all directions

Unsupervised and Transfer Learning Challenge: 1st Place in Final Phase Raw data 2 layers 1 layer 3 layers 4 layers

Transductive Representation Learning • Validation set and test sets have different classes than training set, hence very different input distributions • Directions that matter to distinguish them might have small variance under the training set • Solution: perform last level of unsupervised feature learning (PCA) on the validation / test set input data.

Domain Adaptation (ICML 2011) • Small (4-domain) Amazon benchmark: we beat the state-of-the-art handsomely • Sparse rectifiers SDA finds more features that tend to be useful either for predicting domain or sentiment, not both

Sentiment Analysis: Transfer Learning • 25 Amazon.com domains: toys, software, video, books, music, beauty, … • Unsupervised pre-training of input space on all domains • Supervised SVM on 1 domain, generalize out-of-domain • Baseline: bag-of-words + SVM

Sentiment Analysis: Large Scale Out-of-Domain Generalization Relative loss by going from in-domain to out-of-domain testing 340k examples from Amazon, from 56 (tools) to 124k (music)

Representing Sparse High-Dimensional Stuff • DeepSparse Rectifier Neural Networks, Glorot, Bordes & Bengio, AISTATS 2011. • Sampled Reconstruction for Large-Scale Learning of Embeddings, Dauphin, Glorot & Bengio, ICML 2011. code= latent features expensive cheap … … sparse input dense output probabilities

Speedup from Sampled Reconstruction

DeepSelf-Taught Learning for HandwrittenCharacter RecognitionY. Bengio & 16 others(IFT6266 class project & AISTATS 2011 paper) discriminate 62 character classes (upper, lower, digits), 800k to 80M examples Deep learners beat state-of-the-art on NIST and reach human-level performance Deep learners benefit more from perturbed (out-of-distribution) data Deep learners benefit more from multi-task setting

Improvement due to training on perturbed data Deep Shallow {SDA,MLP}1: trained only on data with distortions: thickness, slant, affine, elastic, pinch {SDA,MLP}2: all perturbations

Improvement due to multi-task setting Deep Shallow Multi-task: Train on all categories, test on target categories (share representations) Not multi-task: Train and test on target categories only (no sharing, separate models)

Comparing against Humans and SOA All 62 character classes Deep learners reach human performance [1] Granger et al., 2007 [2] Pérez-Cortes et al., 2000 (nearestneighbor) [3] Oliveira et al., 2002b (MLP) [4] Milgram et al., 2005 (SVM)

Tips & Tricks • Don’t be scared by the many hyper-parameters: use random sampling (not grid search) & clusters / GPUs • Learning rate is the most important, along with top-level dimension • Make sure selected hyper-param value is not on the border of interval • Early stopping • Using NLDR for visualization • Simulation of Final Evaluation Scenario

Conclusions • Deep Learning: powerful arguments & generalization principles • Unsupervised Feature Learning is crucial: many new algorithms and applications in recent years • DL particularly suited for multi-task learning, transfer learning, domain adaptation, self-taught learning, and semi-supervised learning with few labels

http://deeplearning.net http://deeplearning.net/software/theano: numpy GPU

Merci! Questions? LISA ML Lab team:

Deep Learning of Representations for Unsupervised and Transfer Learning