1 / 58

Learning from Partially Labeled Data

Learning from Partially Labeled Data. Martin Szummer MIT AI lab & CBCL szummer@ai.mit.edu http://www.ai.mit.edu/people/szummer/. Detecting cars. Outline. The partially labeled data problem Data representations Markov random walk Classification criteria Information Regularization.

Télécharger la présentation

Learning from Partially Labeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning from Partially Labeled Data Martin Szummer MIT AI lab & CBCL szummer@ai.mit.edu http://www.ai.mit.edu/people/szummer/

  2. Detecting cars

  3. Outline • The partially labeled data problem • Data representations • Markov random walk • Classification criteria • Information Regularization

  4. Learning from partially labeled data - semi-supervised learning semi-supervised supervised unsupervised

  5. Semi-supervised learning from an unsupervised perspective labels constrain and repair clusters ? or include labels!

  6. Semi-supervised learning from a supervised perspective labeled + unlab boundary labeled boundary unlabclass 1class -1

  7. Benefits of semi-supervised learning • Labeled data can be • expensive may require human labor, and additional experiments / measurements • impossible to obtain labels unavailable at the present time; e.g. for prediction • Unlabeled data can be abundant and cheap! e.g. image sequences from video cameras, text documents from the web

  8. Can we always benefit from partially labeled data? • Not always! • Assumptions required • Labeled and unlabeled data drawn IID from same distribution • Ignorable missingness mechanism • and…

  9. Key assumption • The structure in the unlabeled data must relate to the desired classification; specifically: • A link between the marginal P(x) and the conditional P(y|x), which our classifier is equipped to exploit • Marginal distribution P(x): describes the input domain • Conditional distribution P(y|x): describes the classification Example assumption: points in the same cluster should have the same label

  10. The learning task: notation Task: classify [a subset of] the unlabeled points training with both the labeled and unlabeled points

  11. Previous approach: missing data with EM • Maximize likelihood of a generative model that accounts for P(x) and P(x,y) • Models P(x) and P(x,y) can be mixtures of Gaussians [Miller & Uyar], or Naïve Bayes [Nigam et al] • Issues: what model? How weight unlabeled vs. labeled?

  12. Previous approach: Large margin on unlabeled data • Transduction with SVM or MED (max entropy discrimination) • Issues: computational cost

  13. Outline • The partially labeled data problem • Data representations • Markov random walk • Classification criteria • Information Regularization

  14. unlabeled labeled +1 labeled -1 Clusters and low-dimensional structures

  15. Representation desiderata • Conditional should follow the data manifold- data may lie in a low-dimensional subspace Example: neighborhood graph • Robustly measure similarity between points.Consider volume of all paths, not just shortest path. Example: Markov random walk • Variable resolution: adjustable cluster size or number(differentiate points at coarser scales, not at finer scales)Example: number of time steps t of Markov random walk determines whether two points appear indistinguishable Construct a representation P(i|xk) that satisfies these goals.

  16. Example: Markov random walk representation Example instantiation Local neighborhood relation Euclidean Localtransition probabilities to K nearestneighbors Globaltransition probabilities in tsteps Global representation renormalizes

  17. Representation • Each point k is represented as a vector of (conditional) probabilities over the possible starting states i of a t step random walk ending up in k. • Two points are similar  their random walks have indistinguishable starting points

  18. Parameter: length of random walk t • Higher t coarser representation; fewer clusters • Limits: t = 0,  (degenerate) • Choosing t – based on unlabeled data alone • diameter of graph • mixing time of graph (2nd eigenvalue of transition matrix) • Choosing t – based on both labeled + unlabeled data • when labels are consistent over large regions  t is high • criteria: maximize likelihood, or margin, or cross-validation

  19. A Generative Model for the Labels • Given: nodes i (corresponding to points xi ) Given:label distributions Q(y|i) at each node i Model generates a node identity and a label • Draw a node identity i uniformly Draw a label y ~ Q(y|i) 2. Add t rounds of identity noise: node i is confused with node k according to P(k|i). Label y is intact. 3. Output final identity k, and the label y During classification: only the noisy node identity is observed, and we want to determine the label y.

  20. Classification model Given the noisy node identity k, infer possible starting node identities i, and weight their label distributions Question: how do we obtain Q(y|i)?

  21. Unlike a linear classifier parameters Q(y|i) are bounded, limiting the effects of outliers classifier is directly applicable to multiple classes Link between P(x) and P(y|x): smoothness of the representation Classification model (2)

  22. Maximize conditional log-likelihood EM algorithm conditional over labeled points

  23. unlabeled labeled +1 labeled -1 Swiss roll problem

  24. Swiss roll problem

  25. t=20

  26. t=10

  27. t=3

  28. Summary: Markov Random Walk representation • Points are expressed as a vectors of probabilities, of having been generated by every other point • Related work: Clustering Markovian relaxation [Tishby & Slonim 00] Spectral clustering [Shi & Malik 97; Meila & Shi 00; ++] Visualization: Isomap [Tenenbaum 99] Linear local embedding [Roweis & Saul 00]

  29. Outline • The partially labeled data problem • Data representations • Kernel expansion • Markov random walk • Classification criteria • conditional maximum likelihood with EM • maximize average margin • … • Information Regularization

  30. Discriminative boundaries Focus on classification decisions more directly than maximum likelihood does Classify labeled points with a margin Margin at point xk:confidence of the classifier

  31. maximize average margin Margin based estimation

  32. Average margin solution has a closed form Closed form: assign weight 1 to the class with largest total “flow” to point m. Two rounds of a weighted neighbor classifier • Classify all points based on the labeled points • Classify all points based on the previous classification

  33. Text classification with Markov random walks 20 Newsgroups dataset Mac Vs. PC2000 examples, 7500 dimensions, averages over 20 runs

  34. 1 Class Mac Class Win 0.8 average margin per class 0.6 0.4 0.2 0 5 10 15 20 t Choosing t based on margin Choose t to maximize average margin on labeled and unlabeled points.

  35. Car Detection 2500 road scene images; split evenly between cars and non-cars

  36. Haar wavelet features

  37. Markov Random walk with 1 step (t=1) 0.2 NU=0 NU=256 0.18 NU=512 NU=1024 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 10 10 10 10 # labeled

  38. Markov Random walk with 1 step (t=1) 0.2 NL=16 NL=32 0.18 NL=64 NL=128 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 10 10 10 10 10 # unlabeled

  39. Markov Random walk (t=5) 0.2 NU=256 NU=512 0.18 NU=1024 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 10 10 10 10 # labeled

  40. Markov random walk (t=5), varying unlabeled 0.2 NL=16 NL=32 0.18 NL=64 NL=128 0.16 0.14 0.12 error rate 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 10 10 10 10 10 # unlabeled

  41. Outline • The partially labeled data problem • Data representations • Kernel expansion • Markov random walk • Classification criteria • Information Regularization

  42. Information Regularization Overview Markov random walk • Linked P(x) to P(y|x) indirectly through the classification model Information Regularization • Explicitly and directly links P(x) to P(y|x) • Makes no parametric assumptions on the link

  43. Assumption: Inside small regions with a large number of points, the labeling should not change Regularization approach: Cover the domain with small regions, and penalize inhomogeneous labelings in the regions

  44. Mutual information • Mutual information I(x; y) over a region • I(x; y) = how many bits of information does knowledge about x contribute to knowledge about y, on average • I(x ; y) = H(y) – H(y|x), a function of P(x) and P(y|x) • a measure of homogeneity of labels

  45. Example: x = location within the circle; y ={+, –} Mutual Information – a homogeneity measure permutation invariant in both x and y

  46. Information Regularization (in small region) Penalize weighted mutual information over a small regionQ in the input domain MQ =probability mass of x in region Q high density region  penalize more VQ =variance of x in region Q IQ/VQ is independent of size of Q as Q shrinks

  47. Information Regularization (whole domain) • Cover the domain with small overlapping regions • Regularize each region • Cover should be connected • Example cover: balls centered at each data point • Trade-off: smaller regions VS more overlap small regions: preserve spatial locality overlap: consistent regularization across regions

  48. Minimize Max Information Content • Minimize the maximum information  contained in any region Q in the cover

More Related