1k likes | 1.13k Vues
This thesis explores methods for building prediction systems when training examples are limited or costly to obtain. By leveraging extra information from both input and output spaces, we aim to improve prediction performance while minimizing the costs of gathering high-quality training data. The work discusses different regularization techniques for input encoding and innovative coding methods for output representation, enabling effective learning from unlabeled text and enhancing multi-task learning through robust statistical techniques.
E N D
Learning with Limited Supervision by Input and Output Coding Yi Zhang Machine Learning Department Carnegie Mellon University April 30th, 2012
Thesis Committee • Jeff Schneider, Chair • Geoff Gordon • Tom Mitchell • Xiaojin (Jerry) Zhu, University of Wisconsin-Madison
Introduction (x1,y1) … (xn,yn) • Learning a prediction system, usually based on examples • Training examples are usually limited • Cost of obtaining high-quality examples • Complexity of the prediction problem Y Learning X
Introduction (x1,y1) … (xn,yn) • Solution: exploit extra information about the input and output space • Improve the prediction performance • Reduce the cost for collecting training examples Y Learning X
Introduction (x1,y1) … (xn,yn) ? ? • Solution: exploit extra information about the input and output space • Representation and discovery? • Incorporation? Y Learning X
Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Regularization • The general formulation • Ridge regression • Lasso
Outline Part I: Encoding Input Information by Regularization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Learning with unlabeled text • For a text classification task • : plenty of unlabeled text on the Web • : seemingly unrelated to the task • What can we gain from such unlabeled text? Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008
A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why?
A motivating example for text learning • Humans learn text classification effectively! • Two training examples: • +: [gasoline, truck] • -: [vote, election] • Query: • [gallon, vehicle] • Seems very easy! But why? • Gasoline ~ gallon, truck ~ vehicle
A covariance operator for regularization • Covariance structure of model coefficients • Usually unknown -- learn from unlabeled text?
Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words
Learning with unlabeled text • Infer the covariance operator • Extract latent topics from unlabeled text (with resampling) • Observe the contribution of words in each topic [gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] • Estimate the correlation (covariance) of words • For a new task, we learn with regularization
Experiments • Empirical results on 20 newsgroups • 190 1-vs-1 classification tasks, 2% labeled examples • For any task, majority of unlabeled text (18/20) is irrelevant • Similar results on logistic regression and least squares [1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Multi-task learning • Different but related prediction tasks • An example • Landmine detection using radar images • Multiple tasks: different landmine fields • Geographic conditions • Landmine types • Goal: information sharing among tasks
Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix W =
Regularization for multi-task learning • Our approach: view MTL as estimating a parameter matrix • A covariance operator for regularizing a matrix? • Vector w: • Matrix W: W = (Gaussian prior) ? Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010
Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and full covariance row covariance column covariance ≈
Matrix-normal distributions • Consider a 2 by 3 matrix W: • The full covariance = Kronecker product of and • The matrix-normal density offers a compact form for full covariance row covariance column covariance ≈
Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization Matrix-normal prior
Learning with a matrix-normal penalty • Joint learning of multiple tasks • Alternating optimization • Other recent work as variants of special cases • Multi-task feature learning [Argyriou et al, NIPS 06]: learning with the feature covariance • Clustered multi-task learning [Jacob et al, NIPS 08]: learning with the task covariance and spectral constraints • Multi-task relationship learning [Zhang et al, UAI 10]: learning with the task covariance Matrix-normal prior
Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W
Sparse covariance selection • Sparse covariance selection in matrix-normal penalties • Sparsity of • Conditional independence of rows (tasks) and columns (feature dimensions) of W • Alternating optimization • Estimating W: same as before • Estimating and : L-1 penalized covariance estimation
Results on multi-task learning • Landmine detection: multiple landmine fields • Face recognition: multiple 1-vs-1 tasks [1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008 [2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Learning compressible models • Learning compressible models • A compression operator P instead of • Bias: model compressibility Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010
Energy compaction • Image energy is concentrated at a few frequencies JPEG (2D-DCT), 46 : 1 compression
Energy compaction • Image energy is concentrated at a few frequencies • Models need to operate at relevant frequencies JPEG (2D-DCT), 46 : 1 compression 2D-DCT
Digit recognition: • Sparse vs. compressible • Model coefficients w sparse vs compressible sparse vs compressible sparse vs compressible compressed coefficients Pw coefficients w coefficients w as an image
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Dimension reduction • Dimension reduction conveys information about the input space • Feature selection importance • Feature clustering granularity • Feature extraction more general structures
How to use a dimension reduction? • However, any reduction loses certain information • May be relevant to a prediction task • Goal of projection penalties: • Encode useful information from a dimension reduction • Control the risk of potential information loss Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
Projection penalties: the basic idea • The basic idea: • Observation: reduce the feature space restrict the model search to a model subspace MP • Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP
Projection penalties: linear cases • Learn with a (linear) dimension reduction P
Projection penalties: linear cases • Learn with projection penalties • Optimization: projection distance
Projection penalties: nonlinear cases w MP M P wP Rd Rp P ? F’ F X P ? F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
Projection penalties: nonlinear cases w MP M P wP Rd Rp M w MP P wP F’ F X w MP M P wP F’ F Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010
2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Projection Penalty Original Reduction Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors Projection Penalty Projection Penalty Original Original Reduction Reduction
2% training 5% training 10% training 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Orig Red Proj Orig Red Proj Orig Red Proj Empirical results • Text classification (20 newsgroups), using logistic regression • Dimension reduction: latent Dirichlet allocation Classification Errors • Similar results on face recognition, using SVM (poly-2) • Dimension reduction: KPCA, KDA, OLaplacian Face • Similar results on house price prediction, using regression • Dimension reduction: PCA and partial least squares
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Multi-label classification • Multi-label classification • Existence of certain label dependency • Example: classify an image into scenes (deserts, river, forest, etc) • Multi-class problem is a special case: only one class is true Label dependency Learn to predict … x y1 y2 yq
Output coding • d < q: compression, i.e., source coding • d > q: error-correcting codes, i.e., channel coding • Use the redundancy to correct prediction (“transmission”) errors Learn to predict … x z z2 z3 zd z1 encoding decoding … y1 y2 yq y
Error-correcting output codes (ECOCs) • Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001] • Encode into a (redundant) set of binary problems • Learn to predict the code • Decode the predictions • Our goal: design ECOCs for multi-label classification y1 y2 vs. y3 {y3,y4} vs. y7 Learn to predict … … x z1 z2 zt encoding decoding … y1 y2 yq
Outline Part I: Encoding Input Information by Regularization Multi-task generalization Learning with word correlation A matrix-normal penalty for multi-task learning Go beyond covariance and correlation structures Encode a dimension reduction Learn compressible models Projection penalties Part II: Encoding Output Information by Output Codes Composite likelihood for pairwise coding Multi-label output codes with CCA Maximum-margin output coding
Composite likelihood • The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods • e.g., pairwise likelihood: • e.g., full conditional likelihood • Estimation using composite likelihoods • Computational and statistical efficiency • Robustness under model misspecification
Multi-label problem decomposition • Problem decomposition methods • Decomposition into subproblems (encoding) • Decision making by combining subproblem predictions (decoding) • Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc … … … … Learn to predict x … y1 y2 yq
1-vs-All (Binary Relevance) Independently • Classify each label independently • The composite likelihood view Learn to predict … x y1 y2 yq