Learning from Partially Labeled Data

Learning from Partially Labeled Data Martin Szummer MIT AI lab & CBCL szummer@ai.mit.edu http://www.ai.mit.edu/people/szummer/

Detecting cars

Outline • The partially labeled data problem • Data representations • Markov random walk • Classification criteria • Information Regularization

Learning from partially labeled data - semi-supervised learning semi-supervised supervised unsupervised

Semi-supervised learning from an unsupervised perspective labels constrain and repair clusters ? or include labels!

Semi-supervised learning from a supervised perspective labeled + unlab boundary labeled boundary unlabclass 1class -1

Benefits of semi-supervised learning • Labeled data can be • expensive may require human labor, and additional experiments / measurements • impossible to obtain labels unavailable at the present time; e.g. for prediction • Unlabeled data can be abundant and cheap! e.g. image sequences from video cameras, text documents from the web

Can we always benefit from partially labeled data? • Not always! • Assumptions required • Labeled and unlabeled data drawn IID from same distribution • Ignorable missingness mechanism • and…

Key assumption • The structure in the unlabeled data must relate to the desired classification; specifically: • A link between the marginal P(x) and the conditional P(y|x), which our classifier is equipped to exploit • Marginal distribution P(x): describes the input domain • Conditional distribution P(y|x): describes the classification Example assumption: points in the same cluster should have the same label

The learning task: notation Task: classify [a subset of] the unlabeled points training with both the labeled and unlabeled points

Previous approach: missing data with EM • Maximize likelihood of a generative model that accounts for P(x) and P(x,y) • Models P(x) and P(x,y) can be mixtures of Gaussians [Miller & Uyar], or Naïve Bayes [Nigam et al] • Issues: what model? How weight unlabeled vs. labeled?

Previous approach: Large margin on unlabeled data • Transduction with SVM or MED (max entropy discrimination) • Issues: computational cost

Outline • The partially labeled data problem • Data representations • Markov random walk • Classification criteria • Information Regularization

unlabeled labeled +1 labeled -1 Clusters and low-dimensional structures

Representation desiderata • Conditional should follow the data manifold- data may lie in a low-dimensional subspace Example: neighborhood graph • Robustly measure similarity between points.Consider volume of all paths, not just shortest path. Example: Markov random walk • Variable resolution: adjustable cluster size or number(differentiate points at coarser scales, not at finer scales)Example: number of time steps t of Markov random walk determines whether two points appear indistinguishable Construct a representation P(i|xk) that satisfies these goals.

Example: Markov random walk representation Example instantiation Local neighborhood relation Euclidean Localtransition probabilities to K nearestneighbors Globaltransition probabilities in tsteps Global representation renormalizes

Representation • Each point k is represented as a vector of (conditional) probabilities over the possible starting states i of a t step random walk ending up in k. • Two points are similar  their random walks have indistinguishable starting points

Parameter: length of random walk t • Higher t coarser representation; fewer clusters • Limits: t = 0,  (degenerate) • Choosing t – based on unlabeled data alone • diameter of graph • mixing time of graph (2nd eigenvalue of transition matrix) • Choosing t – based on both labeled + unlabeled data • when labels are consistent over large regions  t is high • criteria: maximize likelihood, or margin, or cross-validation

A Generative Model for the Labels • Given: nodes i (corresponding to points xi ) Given:label distributions Q(y|i) at each node i Model generates a node identity and a label • Draw a node identity i uniformly Draw a label y ~ Q(y|i) 2. Add t rounds of identity noise: node i is confused with node k according to P(k|i). Label y is intact. 3. Output final identity k, and the label y During classification: only the noisy node identity is observed, and we want to determine the label y.

Classification model Given the noisy node identity k, infer possible starting node identities i, and weight their label distributions Question: how do we obtain Q(y|i)?

Unlike a linear classifier parameters Q(y|i) are bounded, limiting the effects of outliers classifier is directly applicable to multiple classes Link between P(x) and P(y|x): smoothness of the representation Classification model (2)

Maximize conditional log-likelihood EM algorithm conditional over labeled points

unlabeled labeled +1 labeled -1 Swiss roll problem

Swiss roll problem

t=20

t=10

t=3

Summary: Markov Random Walk representation • Points are expressed as a vectors of probabilities, of having been generated by every other point • Related work: Clustering Markovian relaxation [Tishby & Slonim 00] Spectral clustering [Shi & Malik 97; Meila & Shi 00; ++] Visualization: Isomap [Tenenbaum 99] Linear local embedding [Roweis & Saul 00]

Outline • The partially labeled data problem • Data representations • Kernel expansion • Markov random walk • Classification criteria • conditional maximum likelihood with EM • maximize average margin • … • Information Regularization

Discriminative boundaries Focus on classification decisions more directly than maximum likelihood does Classify labeled points with a margin Margin at point xk:confidence of the classifier

maximize average margin Margin based estimation

Average margin solution has a closed form Closed form: assign weight 1 to the class with largest total “flow” to point m. Two rounds of a weighted neighbor classifier • Classify all points based on the labeled points • Classify all points based on the previous classification

Text classification with Markov random walks 20 Newsgroups dataset Mac Vs. PC2000 examples, 7500 dimensions, averages over 20 runs

1 Class Mac Class Win 0.8 average margin per class 0.6 0.4 0.2 0 5 10 15 20 t Choosing t based on margin Choose t to maximize average margin on labeled and unlabeled points.

Car Detection 2500 road scene images; split evenly between cars and non-cars

Haar wavelet features

Markov Random walk with 1 step (t=1) 0.2 NU=0 NU=256 0.18 NU=512 NU=1024 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 10 10 10 10 # labeled

Markov Random walk with 1 step (t=1) 0.2 NL=16 NL=32 0.18 NL=64 NL=128 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 10 10 10 10 10 # unlabeled

Markov Random walk (t=5) 0.2 NU=256 NU=512 0.18 NU=1024 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 10 10 10 10 # labeled

Markov random walk (t=5), varying unlabeled 0.2 NL=16 NL=32 0.18 NL=64 NL=128 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 10 10 10 10 10 # unlabeled

Outline • The partially labeled data problem • Data representations • Kernel expansion • Markov random walk • Classification criteria • Information Regularization

Information Regularization Overview Markov random walk • Linked P(x) to P(y|x) indirectly through the classification model Information Regularization • Explicitly and directly links P(x) to P(y|x) • Makes no parametric assumptions on the link

Assumption: Inside small regions with a large number of points, the labeling should not change Regularization approach: Cover the domain with small regions, and penalize inhomogeneous labelings in the regions

Mutual information • Mutual information I(x; y) over a region • I(x; y) = how many bits of information does knowledge about x contribute to knowledge about y, on average • I(x ; y) = H(y) – H(y|x), a function of P(x) and P(y|x) • a measure of homogeneity of labels

Example: x = location within the circle; y ={+, –} Mutual Information – a homogeneity measure permutation invariant in both x and y

Information Regularization (in small region) Penalize weighted mutual information over a small regionQ in the input domain MQ =probability mass of x in region Q high density region  penalize more VQ =variance of x in region Q IQ/VQ is independent of size of Q as Q shrinks

Information Regularization (whole domain) • Cover the domain with small overlapping regions • Regularize each region • Cover should be connected • Example cover: balls centered at each data point • Trade-off: smaller regions VS more overlap small regions: preserve spatial locality overlap: consistent regularization across regions

Minimize Max Information Content • Minimize the maximum information  contained in any region Q in the cover

Learning from Partially Labeled Data

Learning from Partially Labeled Data

Presentation Transcript

LEARNING FROM DATA

Semi-supervised Learning on Partially Labeled Imbalanced Data

Predictive Learning from Data

Learning From the Data

Predictive Learning from Data

Learning with Ambiguously Labeled Training Data

Learning From Data

LEARNING FROM NOISY DATA

Partially labeled classification with Markov random walks

Text Classification with Limited Labeled Data

Chapter 5: Partially-Supervised Learning

Learning from Labeled and Unlabeled Data using Graph Mincuts

Predictive Learning from Data

Text Classification with Limited Labeled Data

Predictive Learning from Data

Predictive Learning from Data

A Theoretical Model for Learning from Labeled and Unlabeled Data

Using Manifold Structure for Partially Labeled Classification

Predictive Learning from Data

Predictive Learning from Data