530 likes | 856 Vues
Dr. Alessandro Perina. Epitomic representations in Computer Vision. Summary. Epitome Video (3 dimensional) Epitome HIV (1 dimensional) Epitome Epitomic location recognition Features Epitome Epitomic priors ( Layered Epitome ) Probabilistic index maps and Stel models
E N D
Dr. Alessandro Perina Epitomic representations in Computer Vision
Summary • Epitome • Video (3 dimensional) Epitome • HIV (1 dimensional) Epitome • Epitomic location recognition • Features Epitome • Epitomic priors • ( Layered Epitome ) • Probabilistic index maps and Stel models • Stel (structural) Epitome • Counting Grids • Nebojsa Jojic, Brendan J. Frey, Anitha Kannan: Epitomic analysis of appearance and shape. ICCV 2003 • Vincent Cheung, Brendan J. Frey, Nebojsa Jojic: Video Epitomes. CVPR 2005 • Nebojsa Jojic, Vladimir Jojic, Brendan J. Frey, Christopher Meek, David Heckerman: Using epitomes to model genetic diversity: Rational design of HIV vaccines. NIPS 2005 • Kai Ni, Anitha Kannan, Antonio Criminisi, John M. Winn: Epitomic location recognition. CVPR 2008 • Alessandro Perina, Nebojsa Jojic: Work in progress! • Jonathan Warrell, Simon J.D. Prince and Alastair P. Moore: Epitomized Priors for Multi-labeling problems, CVPR 2009 • Alessandro Perina, Nebojsa Jojic: Unpublished (no time..) • Alessandro Perina, Vittorio Murino, Nebojsa Jojic: Structural Epitome: a way to summarize one’s visual experience. NIPS 2010 • Alessandro Perina, Nebojsa Jojic: Image analysis by counting on a grid. CVPR 2011
Epitome • … to learn a library of patches (Clustering) • … to edit images • … to organize the visual memory (PanClustering) • An intermediate representation between the template and the histogram Learning
Generative model of patches • Originally defined on a domain smaller than the image • Defined by its mean m and variance f • A patch is generated from the epitome choosing a mapping according to p(t) and generating each pixel from m and f tk Zk Mean: e ei Variance T Zi
Learning • Minimize the free energy (a bound on the data log likelihood) with EM algorithm • E step: • M step: For each patch, infer the posterior over the mappings Average all patches using the posterior as a weight Estimate the variance ei Zi T
Learning From Iteration 1… …to iteration 16
Epitome as clustering method • Smart Mixture of Gaussians model with parameters sharing among components • Mixture of gaussians ( pick a center ) • Epitome ( pick a position ) Mixture of Gaussians Epitome Reconstruction
Applications and results Multiple Causes pattern recognition Background removal
1,3D Epitomes 3d epitome 1d epitome Epitome as a sequence of multinomial distributions Applied in Biology The sequences share patterns (although sometimes with discrepancies in isolated amino-acids) but one sequence may be similar to other sequences in different regions. • Applications: • Video inpainting • Super-resolution • Denoising • Dropping frame recovery
Epitomic Location Recognition Train images • Problem: location recognition, e.g. where am I? • Must fuse appearance and geometry • To adapt to camera rotation and focal length change • Must be better than Bag-of-words representation ( k-NN ) • Shift may disrupt recognition • Needs enough samples • Solution: Epitomic representation Test image Epitomic representation (panorama)
Smart mixture of Gaussian (again) Mixture of Gaussians Epitome (Panorama) Means Means Variance Variance
Observation model • Multiple observations are also introduced • Continuous : Depth Map, R-G-B channels • Discrete (Categorical) : Edges, Class label • Local histograms used as appearance • Supervised learning of the epitome • Incorporate training labels l as observation (at training time) • Posterior over the labels used for recognition
Results • Tested on the MIT 7 classes/62 locations dataset previously used by Torralba’s MoG-approach • Location recognition (Recognize 2 different offices) • Scene recognition (Recognize officesVs corridors)
Feature epitome • Idea: Use the epitome to cluster features (… again, epitome as smart mixture of Gaussians) • Discrete • Already discretized sift features • Continuous • Image patches • Feature descriptors (e.g., SIFT)
Feature epitome: ideas Continuous discrete To capture pattern of discrete features Bag of words Bag of words from epitome Learn Epitome from 2x2 patches For each image Map its P patches onto the epitome (do inference) Use as signature • Smart mixture of gaussians • We are developing a Generative model to cluster the locations on the epitome + rot. + sca. + ref. + rot. + sca. + ref. Different centers * Same edge! 6 0 2 …. Cluster j Cluster i
Patches generated by an epitome • Segmentation • Patches generated from an previously learnt epitome can be used for other (surprising) tasks: • Pedestrian Re-identification The combination red-green-red is not allowed! Segmentation is refined using patch generated from an epitome
Layered epitome • We want to automatically segment images in order to find recurrent segments • Layers (segments) are mapped in the epitome not patches(i.e. Patch size is no longer an issue!) • Very different from the Jigsaw model • Details later… Layer 1 Layer 2 Layer 3 Layer 4
Structural Epitome (idea) Patches Image Epitome Capture the structure of the images
Stel models Nebojsa Jojic, Yaron Caspi:Capturing image structure with probabilistic index map, CVPR 04 Alessandro Perina, NebojsaJojic et al.: Capturing video structure with mixtures of probabilistic index maps, ECCVw 08 Nebojsa Jojic, Alessandro Perina et al.: Stel Component Analysis, CVPR 09 Alessandro Perina, Nebojsa Jojic et al.: Object recognition with hierarchical stel models, ECCV 10
Probabilistic indexmap (PIM) • An index map captures the common structure of a set ofimages • Described wih an index map and a palette for each image • In general, object classes present images with different colors but similar structure Images Structure • Each pixels is characterized by an index • The palette decidesthe color of the pixel
Index maps (again!) Dataset 4 1 2 4 4 3 2 Palette Palette 1 1 Index Map
Probabilistic index map (2) • PIMinjects uncertainty in the indexmap • acts as a prior on “individual” index maps • The palette is modeled as a Gaussian • Each image has its structure (it has its own index map) • Consistence between index maps by meansof the prior “body” “legs” o o o o o o o Posterior p(S) S N ms,ss X
Graphical models Probabilistic index map Epitome stel Epitome es,i p(Si) p(T) ei Si T Si T Zi Xi Zi ms,ss ms,ss
Stel epitome (example) es=1 es=2 es=1 es=2 PIM - q(s) Patches Image p( T ) Patch averaging
Stel (structural) epitome • Stel epitome captures structure patterns recurrent in one (collection of) image • ...as the original epitome model • Generates (color invariant) stel panoramas • …as Epitomic Location Recognition
Stel Epitome vs Regular Epitome 4 frames • Epitome is a appearance based, it doesn’t work if dramatic changes in illumination are present • Different cameras • Difference in environmental illumination • Stel Epitome is... • ...based on PIM therefore color invariant • ...epitome-like model therefore transformation invariant A B C
Modeling a single scene Stel epitome reconstruction Regular epitome
Results • We extracted a subset of senseCam images: 10 classes, 600 images • Unsupervised learning of the epitome (no class information is used)
Comparison with other clustering methods • After learning the epitome, we mapped the test images onto it and we make use of the labels to estimate the accuracy • For epitomes we employed 1-NN using the distance on a torus • For Mixture of PIMs we assign each cluster to a class (majority vote) Epitome mean Epitome variance Stel-epitome rec. Image mappings Image mappings
Place classification • We made use of the labels at training time in 2 ways: • Learning 1 epitome per class • Learning a class-dependent prior on the mappings • We compared epitome models with: • Generative semi-supervised methods • k-NN using bag of words histograms
SenseCam dataset V2.0 • The full dataset (~24 days of life) is available on my webpage: • http://profs.sci.univr.it/~perina/sensecam.htm • 43515 images, 1 photo/20 seconds, 640 x 480 pixels. • Moreover, we labeled 10% of the documents, extracting 32 recurrent scenes like • Work environments: Office, Hall, Lounge, Parking, Cafeterias… • Home: Hall, Kitchen, Dining Room, Office, Living Room… • Outdoor: Biking trail, Hiking trail, MS Campus… • Other: Supermarket, Bakery … • At the moment we are labeling familiar faces, distinguishing between family members (Ana, Ivana and Marko) and other people.
Avoiding numerical underflow • In all the epitome-like generative model one must infer the marginal q( T ) • In practice it happens that areas in the epitome seem never visited by any patch/image x x x x x x x ?
Avoiding numerical underflow (2) • Keep multiple precisions for • For each posterior distribution q(T), remember which is the maximum precision I can use to avoid zeros
Layered epitome (2) Image 2 Image 1 • Introduced to model parallax Parallaxis the apparent displacement of an object viewed along two different points of view • In the 2 images on the right, foreground and background have different “movements” • Foreground (layer 1) is shifted left • Background (layer 2) seems to stay still • Application: reconstruction of an image given N-prototype images (i.e., viewpoint recovery?) • I can arrange the layers of different images to model a new one (target image) • This is equivalent to an epitome model in which the Epitome is the target image, and in the E-step I want to find the layers of prototype images that better fit with it. Layer 1 Layer 2 New point of view! Epitome
Counting Grids • One of the major problems of epitome-like representation is that they require (super)-pixel wise comparisons. • Counting Grids idea: I want to place an image in a spot in the epitome if its bag of words representation agrees with the bag of words representation of other images mapped in the “neighbor” “Same” BoW representation
Learning • E-Step: Compute the mappings using the BoW representation • M-Step: Usual epitome update (i.e. keep the spatial information ) Usual epitome M- Step […]
E-Step • Associated with each position in the grid I have a distribution over the features • The average of e in a window represents the counts that we may observe We have that So we can place an image comparing its BoW representation with j i
FFT- version (M-Step) • M-step is performed by the usual epitome update • Computation of • Computation of • Computation of can be carried out efficiently using the integral images trick • = cumulative sum • .
Counting Grid generative model • The algorithm presented in not a real EM algorithm because we have not defined a generative model • In the real generative model in the M-step each feature is distributed proportionally to • The algorithm just shown is an approximation of a real generative model (… convergence is still guaranteed) 0.5 0.4 1.2 1 0.4 0 1.8 0.8 0.8 0.6 1.4
CGs: another view • I want to predict the feature distribution of D using A, B and C • Compare the featurecounts in the test imagewitheachof the previuslyseenexemplars[Kernel / NearestNeighbortechinques] • Overtraining: None of A,B and C have the featurecombinationof D • Considerallbagstogether and generalize by merging the bags. [LDA, Mixtures of Multinomials] • Overgeneralization: I can generate feature combinations not present in the image
CGs: another view, Solution • Solution: Spatial Reasoning – Features are disordered but constrained in a window of size W • The spatial structure is still reflected in the disordered bags if we consider them together and invert the process of bag creation • E.g., for bag we see that by ordering the features into a strip then each bag is represented by a window. • This makes likely, and unlikely. .
CGs: another view, Solution • Not much of the spatial organization of features in the snapshots is needed to infer a 2D mapping of features in the original scene. Here, it is enough to keep only the histograms in 4 sectors: Learned with 100 patches, none of which has the same feature combination of D.
Results Scene classification Place clustering (epitome) We used the senseCam dataset (600 images, 10 classes) We performed the same test we previously did for stel epitomes • We considered the 15 scenes category dataset • Window size = Image size as we hope to recover the original structure Counting Grids Latent Dirichlet Allocation 1 4 2,5 1,5 6 Ratio Counting Grid size / Window Size
Counting Grids (other results) • Any kind of Bag-of-word representation can be embedded: • Text • Biological sequences and data • Preferences • The window size is critical as different windows size comes with different degrees of overlap • Dimensions could be different that 2! What about embedding images in 3D space? In which dimension should I embed text? • Another parameter to study…. Alessandro Perina, Nebojsa Jojic, Multidimensional counting grids: Inferring word order from disordered bags of word. UAI 2011