Part 2: part-based models

Part 2: part-based models by Rob Fergus (MIT)

Problem with bag-of-words • All have equal probability for bag-of-words methods • Location information is important

Overview of section • Representation • Computational complexity • Location • Appearance • Occlusion, Background clutter • Recognition • Demos

Representation

Model: Parts and Structure

Representation • Object as set of parts • Generative representation • Model: • Relative locations between parts • Appearance of part • Issues: • How to model location • How to represent appearance • Sparse or dense (pixels or regions) • How to handle occlusion/clutter Figure from [Fischler & Elschlager 73]

History of Parts and Structure approaches • Fischler & Elschlager 1973 • Yuille ‘91 • Brunelli & Poggio ‘93 • Lades, v.d. Malsburg et al. ‘93 • Cootes, Lanitis, Taylor et al. ‘95 • Amit & Geman ‘95, ‘99 • Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05 • Felzenszwalb & Huttenlocher ’00, ’04 • Crandall & Huttenlocher ’05, ’06 • Leibe & Schiele ’03, ’04 • Many papers since 2000

Sparse representation + Computationally tractable (105 pixels  101 -- 102 parts) + Generative representation of class + Avoid modeling global variability + Success in specific object recognition - Throw away most image information - Parts need to be distinctive to separate from other classes

Region operators • Local maxima of interest operator function • Can give scale/orientation invariance Figures from [Kadir, Zisserman and Brady 04]

The correspondence problem • Model with P parts • Image with N possible assignments for each part • Consider mapping to be 1-1 • NP combinations!!!

The correspondence problem • 1 – 1 mapping • Each part assigned to unique feature As opposed to: • 1 – Many • Bag of words approaches • Sudderth, Torralba, Freeman ’05 • Loeff, Sorokin, Arora and Forsyth ‘05 • Many – 1 • - Quattoni, Collins and Darrell, 04

Location

Connectivity of parts • Complexity is given by size of maximal clique in graph • Consider a 3 part model • Each part has set of N possible locations in image • Location of parts 2 & 3 is independent, given location of L • Each part has an appearance term, independent between parts. Shape Model Factor graph Variables L 2 3 L 2 3 Factors S(L) S(L,2) S(L,3) A(L) A(2) A(3) Shape Appearance

from Sparse Flexible Models of Local FeaturesGustavo Carneiro and David Lowe, ECCV 2006 Different connectivity structures Felzenszwalb & Huttenlocher ‘00 Fergus et al. ’03 Fei-Fei et al. ‘03 Crandall et al. ‘05 Fergus et al. ’05 Crandall et al. ‘05 O(N2) O(N6) O(N2) O(N3) Csurka ’04 Vasconcelos ‘00 Bouchard & Triggs ‘05 Carneiro & Lowe ‘06

How much does shape help? • Crandall, Felzenszwalb, Huttenlocher CVPR’05 • Shape variance increases with increasing model complexity • Do get some benefit from shape

Hierarchical representations • Pixels  Pixel groupings  Parts  Object • Multi-scale approach increases number of low-level features • Amit and Geman ‘98 • Bouchard & Triggs ‘05 Images from [Amit98,Bouchard05]

Some class-specific graphs • Articulated motion • People • Animals • Special parameterisations • Limb angles Images from [Kumar, Torr and Zisserman 05, Felzenszwalb & Huttenlocher 05]

Dense layout of parts Part labels (color-coded) Layout CRF: Winn & Shotton, CVPR ‘06

Translation Translation and Scaling Similarity transformation Affine transformation How to model location? • Explicit: Probability density functions • Implicit: Voting scheme • Invariance • Translation • Scaling • Similarity/affine • Viewpoint

Explicit shape model • Cartesian • E.g. Gaussian distribution • Parameters of model,  and  • Independence corresponds to zeros in  • Burl et al. ’96, Weber et al. ‘00, Fergus et al. ’03 • Polar • Convenient forinvariance to rotation Mikolajczyk et al., CVPR ‘06

Matched Codebook Entries Probabilistic Voting y y s s x x y y s s x x Spatial occurrence distributions Implicit shape model • Use Hough space voting to find object • Leibe and Schiele ’03,’05 • Learn appearance codebook • Cluster over interest points on training images • Learn spatial distributions • Match codebook to training images • Record matching positions on object • Centroid is given Learning Recognition Interest Points

Deformable Template Matching Berg, Berg and Malik CVPR 2005 Query Template • Formulate problem as Integer Quadratic Programming • O(NP) in general • Use approximations that allow P=50 and N=2550 in <2 secs

invariance of the characteristic scale Other invariance methods • Search over transformations • Large space (# pixels x # scales ….) • Closed form solution for translation and scale (Helmer and Lowe ’04) • Features give information • Characteristic scale • Characteristic orientation (noisy) Figures from Mikolajczyk & Schmid

Orientation Tuning 100 95 90 85 80 % Correct % Correct 75 70 65 60 55 50 0 20 40 60 80 100 angle in degrees Multiple views • Mixture of 2-D models • Weber, Welling and Perona CVPR ‘00 Component 1 Component 2 Frontal Profile

Multiple view points Thomas, Ferrari, Leibe, Tuytelaars, Schiele, and L. Van Gool. Towards Multi-View Object Class Detection, CVPR 06 Hoiem, Rother, Winn, 3D LayoutCRF for Multi-View Object Class Recognition and Segmentation, CVPR ‘07

Appearance

Representation of appearance • Needs to handle intra-class variation • Task is no longer matching of descriptors • Implicit variation (VQ to get discrete appearance) • Explicit model of appearance (e.g. Gaussians in SIFT space) • Dependency structure • Often assume each part’s appearance is independent • Common to assume independence with location

Representation of appearance • Invariance needs to match that of shape model • Insensitive to small shifts in translation/scale • Compensate for jitter of features • e.g. SIFT • Illumination invariance • Normalize out

Appearance representation • SIFT • Decision trees [Lepetit and Fua CVPR 2005] • PCA Figure from Winn & Shotton, CVPR ‘06

Occlusion • Explicit • Additional match of each part to missing state • Implicit • Truncated minimum probability of appearance µpart Appearance space Log probability

Background clutter • Explicit model • Generative model for clutter as well as foreground object • Use a sub-window • At correct position, no clutter is present

Recognition

What task? • Classification • Object present/absent in image • Background may be correlated with object • Localization / Detection • Localize object within the frame • Bounding box or pixel-level segmentation

Efficient search methods • Interpretation tree (Grimson ’87) • Condition on assigned parts to give search regions for remaining ones • Branch & bound, A*

Model L 2 Distance transforms • Felzenszwalb and Huttenlocher ’00 & ’05 • Distance transforms • O(N2P)  O(NP) for tree structured models • How it works • Assume location model is Gaussian (i.e. e-d2 ) • Consider a two part model with µ=0, σ=1 on a 1-D image xi Image pixel Appearance log probability at xi for part 2 = A2(xi) Log probability f(d) = -d2

A2(xj) A2(xi) A2(xg) A2(xk) A2(xl) A2(xh) Distance transforms 2 • For each position of landmark part, find best position for part 2 • Finding most probable xi is equivalent finding maximum over set of offset parabolas • Upper envelope computed in O(N) rather than obvious O(N2) via distance transform (see Felzenszwalb and Huttenlocher ’05). • Add AL(x) to upper envelope (offset by µ) to get overall probability map xg xh xi xj xk xl Image pixel Log probability

Parts and Structure demo • Gaussian location model – star configuration • Translation invariant only • Use 1st part as landmark • Appearance model is template matching • Manual training • User identifies correspondence on training images • Recognition • Run template for each part over image • Get local maxima  set of possible locations for each part • Impose shape model - O(N2P) cost • Score of each match is combination of shape model and template responses.

Demo images • Sub-set of Caltech face dataset • Caltech background images

Demo Web Page

Demo (2)

Demo (3)

Demo (4)

Demo: efficient methods

Stochastic Grammar of ImagesS.C. Zhu et al. and D. Mumford

Context and Hierarchy in a Probabilistic Image ModelJin & Geman (2006) animal head instantiated by bear head e.g. animals, trees, rocks e.g. contours, intermediate objects e.g. linelets, curvelets, T-junctions e.g. discontinuities, gradient animal head instantiated by tiger head

Parts and Structure modelsSummary • Correspondence problem • Efficient methods for large # parts and # positions in image • Challenge to get representation with desired invariance Future directions: • Multiple views • Approaches to learning • Multiple category training

References 2. Parts and Structure

[Agarwal02] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, pages 113-130, 2002. [Agarwal_Dataset] Agarwal, S. and Awan, A. and Roth, D. UIUC Car dataset. http://l2r.cs.uiuc.edu/ ~cogcomp/Data/Car, 2002. [Amit98] Y. Amit and D. Geman. A computational model for visual selection. Neural Computation, 11(7):1691-1715, 1998. [Amit97] Y. Amit, D. Geman, and K. Wilder. Joint induction of shape features and tree classi- ers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1300-1305, 1997. [Amores05] J. Amores, N. Sebe, and P. Radeva. Fast spatial pattern discovery integrating boosting with constellations of contextual discriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 2, pages 769-774, 2005. [Bar-Hillel05] A. Bar-Hillel, T. Hertz, and D. Weinshall. Object class recognition by boosting a part based model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 1, pages 702-709, 2005. [Barnard03] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. JMLR, 3:1107-1135, February 2003. [Berg05] A. Berg, T. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, volume 1, pages 26-33, June 2005. [Biederman87] I. Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115-147, 1987. [Biederman95] I. Biederman. An Invitation to Cognitive Science, Vol. 2: Visual Cognition, volume 2, chapter Visual Object Recognition, pages 121-165. MIT Press, 1995.

[Blei03] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003. [Borenstein02] E. Borenstein. and S. Ullman. Class-specic, top-down segmentation. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, pages 109-124, 2002. [Burl96] M. Burl and P. Perona. Recognition of planar object classes. In Proc. Computer Vision and Pattern Recognition, pages 223-230, 1996. [Burl96a] M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In Proc. European Conference on Computer Vision, pages 628-641, 1996. [Burl98] M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In Proceedings of the European Conference on Computer Vision, pages 628-641, 1998. [Burl95] M.C. Burl, T.K. Leung, and P. Perona. Face localization via shape statistics. In Int. Workshop on Automatic Face and Gesture Recognition, 1995. [Canny86] J. F. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679-698, 1986. [Crandall05] D. Crandall, P. Felzenszwalb, and D. Huttenlocher. Spatial priors for part-based recognition using statistical models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, volume 1, pages 10-17, 2005. [Csurka04] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1-22, 2004. [Dalal05] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, pages 886--893, 2005. [Dempster76] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. JRSS B, 39:1-38, 1976. [Dorko04] G. Dorko and C. Schmid. Object class recognition using discriminative local features. IEEE Transactions on Pattern Analysis and Machine Intelligence, Review(Submitted), 2004.

Part 2: part-based models