1 / 57

Loss-based Learning with Latent Variables

Loss-based Learning with Latent Variables. M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay , Île-de-France. Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman. Image Classification. Images x i. Boxes h i. Labels y i. Image x.

nantai
Télécharger la présentation

Loss-based Learning with Latent Variables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loss-based Learningwith Latent Variables M. Pawan Kumar ÉcoleCentrale Paris Écoledes PontsParisTech INRIA Saclay, Île-de-France Joint work with Ben Packer, Daphne Koller, Kevin Miller and Danny Goodman

  2. Image Classification Images xi Boxes hi Labels yi Image x Bison Deer Elephant y = “Deer” Giraffe Llama Rhino

  3. Image Classification Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h) (-∞, +∞) f(Ψ(x,y1,h))

  4. Image Classification Feature Ψ(x,y,h) (e.g. HOG) Score f : Ψ(x,y,h) (-∞, +∞) Prediction y(f) Learn f f(Ψ(x,y1,h)) f(Ψ(x,y2,h))

  5. Loss-based Learning User defined loss function Δ(y,y(f)) f* = argminfΣiΔ(yi,yi(f)) Minimize loss between predicted and ground-truth output No restriction on the loss function General framework (object detection, segmentation, …)

  6. Outline • Latent SVM • Max-Margin Min-Entropy Models • Dissimilarity Coefficient Learning Andrews et al., NIPS 2001; Smola et al., AISTATS 2005; Felzenszwalb et al., CVPR 2008; Yu and Joachims, ICML 2009

  7. Image Classification Images xi Boxes hi Labels yi Image x Bison Deer Elephant y = “Deer” Giraffe Llama Rhino

  8. Latent SVM Scoring function Parameters Features wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h))

  9. Learning Latent SVM Training data {(xi,yi), i= 1,2,…,n} w* = argminwΣiΔ(yi,yi(w)) Highly non-convex in w Cannot regularize w to prevent overfitting

  10. Learning Latent SVM Training data {(xi,yi), i= 1,2,…,n} wTΨ(x,yi(w),hi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi(w),hi(w)) ≤ wTΨ(x,yi(w),hi(w)) + Δ(yi,yi(w)) - maxhiwTΨ(x,yi,hi) ≤ maxy,h{wTΨ(x,y,h) + Δ(yi,y)} - maxhiwTΨ(x,yi,hi)

  11. Learning Latent SVM Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - maxhiwTΨ(xi,yi,hi)≤ ξi Difference-of-convex program in w Local minimum or saddle point solution (CCCP) Self-Paced Learning, NIPS 2010

  12. Recap Scoring function wTΨ(x,y,h) Prediction y(w),h(w) = argmaxy,hwTΨ(x,y,h)) Learning minw ||w||2 + C Σiξi wTΨ(xi,y,h) + Δ(yi,y) - maxhiwTΨ(xi,yi,hi)≤ ξi

  13. Image Classification Images xi Boxes hi Labels yi Image x Bison Deer Elephant y = “Deer” Giraffe Llama Rhino

  14. Image Classification Score wTΨ(x,y,h) (-∞, +∞) wTΨ(x,y1,h)

  15. Image Classification Score wTΨ(x,y,h) (-∞, +∞) Only maximum score used No other useful cue? Uncertainty in h wTΨ(x,y1,h) wTΨ(x,y2,h)

  16. Outline • Latent SVM • Max-Margin Min-Entropy (M3E) Models • Dissimilarity Coefficient Learning Miller, Kumar, Packer, Goodman and Koller, AISTATS 2012

  17. M3E Scoring function Partition Function Pw(y,h|x)=exp(wTΨ(x,y,h))/Z(x) Rényi Entropy Marginalized Probability Prediction y(w) = argminy Hα(Pw(h|y,x)) – log Pw(y|x) Gα(y;x,w) RényiEntropy of Generalized Distribution

  18. Rényi Entropy Σh Pw(y,h|x)α 1 Gα(y;x,w) = log 1-α Σh Pw(y,h|x) α = 1. Shannon Entropy of Generalized Distribution - Σh Pw(y,h|x)log(Pw(y,h|x)) Σh Pw(y,h|x)

  19. Rényi Entropy Σh Pw(y,h|x)α 1 Gα(y;x,w) = log 1-α Σh Pw(y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - maxh log(Pw(y,h|x))

  20. Rényi Entropy Σh Pw(y,h|x)α 1 Gα(y;x,w) = log 1-α Σh Pw(y,h|x) α = Infinity. Minimum Entropy of Generalized Distribution - maxhwTΨ(x,y,h) Same prediction as latent SVM

  21. Learning M3E Training data {(xi,yi), i= 1,2,…,n} w* = argminwΣiΔ(yi,yi(w)) Highly non-convex in w Cannot regularize w to prevent overfitting

  22. Learning M3E Training data {(xi,yi), i= 1,2,…,n} Gα(yi(w);xi,w) + - Gα(yi(w);xi,w) Δ(yi,yi(w)) - Gα(yi(w);xi,w) ≤ Gα(yi;xi,w) + Δ(yi,yi(w)) maxy{Δ(yi,y) ≤ Gα(yi;xi,w) + - Gα(y;xi,w)}

  23. Learning M3E Training data {(xi,yi), i= 1,2,…,n} minw ||w||2 + C Σiξi Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi When α tends to infinity, M3E = Latent SVM Other values can give better results

  24. Image Classification Mammals Dataset 271 images, 6 classes 90/10 train/test split 5 folds 0/1 loss

  25. Image Classification HOG-Based Model. Dalal and Triggs, 2005

  26. Motif Finding UniProbe Dataset ~ 40,000 sequences Binding vs. Not-Binding 50/50 train/test split 5 Proteins, 5 folds

  27. Motif Finding Motif + Markov Background Model. Yu and Joachims, 2009

  28. Recap Scoring function Pw(y,h|x)=exp(wTΨ(x,y,h))/Z(x) Prediction y(w) = argminyGα(y;x,w) Learning minw ||w||2 + C Σiξi Gα(yi;xi,w) + Δ(yi,y) – Gα(y;xi,w) ≤ ξi

  29. Outline • Latent SVM • Max-Margin Min-Entropy Models • Dissimilarity Coefficient Learning Kumar, Packer and Koller, ICML 2012

  30. Object Detection Images xi Boxes hi Labels yi Image x Bison Deer Elephant y = “Deer” Giraffe Llama Rhino

  31. Minimizing General Loss Supervised Samples minwΣiΔ(yi,hi,yi(w),hi(w)) Weakly Supervised Samples + ΣiΔ’(yi,yi(w),hi(w)) Unknown latent variable values

  32. Minimizing General Loss minwΣiΔ(yi,hi,yi(w),hi(w)) Σhi Pw(hi|xi,yi) A single distribution to achieve two objectives

  33. Problem Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

  34. Solution Use two different distributions for the two different tasks Model Uncertainty in Latent Variables Model Accuracy of Latent Variable Predictions

  35. Solution Use two different distributions for the two different tasks Pθ(hi|yi,xi) hi Model Accuracy of Latent Variable Predictions

  36. Solution Use two different distributions for the two different tasks Pθ(hi|yi,xi) hi Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  37. The Ideal Case No latent variable uncertainty, correct prediction Pθ(hi|yi,xi) hi hi(w) Pw(yi,hi|xi) (yi,hi(w)) (yi,hi)

  38. In Practice Restrictions in the representation power of models Pθ(hi|yi,xi) hi Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  39. Our Framework Minimize the dissimilarity between the two distributions Pθ(hi|yi,xi) hi User-defined dissimilarity measure Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  40. Our Framework Minimize Rao’s Dissimilarity Coefficient Pθ(hi|yi,xi) hi ΣhΔ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  41. Our Framework Minimize Rao’s Dissimilarity Coefficient Pθ(hi|yi,xi) hi Hi(w,θ) - β Σh,h’Δ(yi,h,yi,h’)Pθ(h|yi,xi)Pθ(h’|yi,xi) Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  42. Our Framework Minimize Rao’s Dissimilarity Coefficient Pθ(hi|yi,xi) hi Hi(w,θ) - β Hi(θ,θ) - (1-β) Δ(yi(w),hi(w),yi(w),hi(w)) Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  43. Our Framework Minimize Rao’s Dissimilarity Coefficient Pθ(hi|yi,xi) hi minw,θ Σi Hi(w,θ) - β Hi(θ,θ) Pw(yi,hi|xi) (yi,hi) (yi(w),hi(w))

  44. Optimization minw,θΣiHi(w,θ) - β Hi(θ,θ) Initialize the parameters to w0 and θ0 Repeat until convergence Fix w and optimize θ Fix θ and optimize w End

  45. Optimization of θ minθΣiΣhΔ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ) Pθ(hi|yi,xi) hi hi(w) Case I: yi(w) = yi

  46. Optimization of θ minθΣiΣhΔ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ) Pθ(hi|yi,xi) hi hi(w) Case I: yi(w) = yi

  47. Optimization of θ minθΣiΣhΔ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ) Pθ(hi|yi,xi) hi Case II: yi(w) ≠ yi

  48. Optimization of θ minθΣiΣhΔ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) - β Hi(θ,θ) Stochastic subgradient descent Pθ(hi|yi,xi) hi Case II: yi(w) ≠ yi

  49. Optimization of w minwΣiΣhΔ(yi,h,yi(w),hi(w))Pθ(h|yi,xi) Expected loss, models uncertainty Form of optimization similar to Latent SVM Concave-Convex Procedure (CCCP) Δ independent of h, implies latent SVM

  50. Object Detection Train Input xi Output yi Input x Bison Deer Elephant Output y = “Deer” Latent Variable h Giraffe Mammals Dataset Llama 60/40 Train/Test Split 5 Folds Rhino

More Related