Class 3: Feature Coding and Pooling

Class 3: Feature Coding and Pooling Liangliang Cao, Feb 7, 2013 EECS 6890 - Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/VisualRecognitionAndSearch

Problem The Myth of Human Vision Can you understand the following? 婴儿बचचा 아가bebé People may have difficulties to understand different texts, but NOT images. Photo courtesy to luster Visual Recognition And Search Columbia University, Spring 2013 2

Problem Can the computer vision system recognize objects or scenes like human? Visual Recognition And Search Columbia University, Spring 2013 3

Why This Problem Important? Mobile Apps Traditional companies Searching engines Visual Recognition And Search Columbia University, Spring 2013 4

Problem Examples of Object Recognition Dataset http://www.vision.caltech.edu Visual Recognition And Search Columbia University, Spring 2013 5

Problem http://groups.csail.mit.edu/vision/SUN/ Visual Recognition And Search Columbia University, Spring 2013 6

Overview of Classification Model Uncertainty- Based Fisher vector/ Supervector Histogram Sparse Coding of SIFT Quantization Coding GMMprobability Vector Soft Soft quantization quantization estimation quantization Pooling Histogram GMM Histogramaggregation Max pooling adptation aggregation Coding: to map local features into a compact representation Pooling: to aggregate these compact representation together Visual Recognition And Search Columbia University, Spring 2013 7

Outline Outlines • Histogram of local features • Bag of words model • Soft quantization and sparse coding • Fisher vector and supervector Visual Recognition And Search Columbia University, Spring 2013 8

Histogram of Local Features Visual Recognition And Search Columbia University, Spring 2013 9

Bag of Words Models Recall of Last Class • Powerful local features - DoG - Hessian, Harris - Dense-sampling Non-fixed number of local regions per image! Visual Recognition And Search Columbia University, Spring 2013 10

Bag of Words Models Recall of Last Class (2) • Histograms can provide a fixed size representation ofimages • Spatial pyramid/gridding can enhance histogrampresentation with spatial information Visual Recognition And Search Columbia University, Spring 2013 11

Bag of Words Models Histogram of Local Features ….. dim = # codewords codewords Visual Recognition And Search Columbia University, Spring 2013 12

Bag of Words Models Histogram of Local Features (2) …… dim = #codewords x #grids Visual Recognition And Search Columbia University, Spring 2013 13

Bag of Words Models Local Feature Quantization … Slide courtesy to Fei-Fei Li Visual Recognition And Search Columbia University, Spring 2013 14

Bag of Words Models Local Feature Quantization … Visual Recognition And Search Columbia University, Spring 2013 15

Bag of Words Models Local Feature Quantization … - Vector quantization - Dictionary learning Visual Recognition And Search Columbia University, Spring 2013 16

Histogram of Local Features Dictionary for Codewords Visual Recognition And Search Columbia University, Spring 2013 17

Bag of Words Models Most slides in this section are courtesy to Fei-Fei Li Visual Recognition And Search Columbia University, Spring 2013 18

Object Bag of ‘words’ Visual Recognition And Search Columbia University, Spring 2013 19

Bag of Words Models Underlining Assumptions - Text Of all the sensory impressions proceeding tothe brain, the visual experiences are the China is forecasting a trade surplus of $90bn(£51bn) to $100bn this year, a threefold dominant ones. Our perception of the worldaround us is based essentially on the increase on 2004's $32bn. The Commerce Ministry said the surplus would be created bya pred messa For a l compa image sensory, brain, $660b China, trade, center annoy visual, perception, surplus, commerce, movie China' retinal, cerebral cortex, exports, imports, US, image delibe eye, cell, optical nerve, image yuan, bank, domestic, foreign, increase, discov agree know t yuan i percep gover Hubel, Wiesel trade, value more c also n followi dema to the countr Hubel yuan a demon permit image the US wise a freely. stored it will t has its allowi a spec image. Visual Recognition And Search Columbia University, Spring 2013 20

Bag of Words Models Underlining Assumptions - Image Visual Recognition And Search Columbia University, Spring 2013 21

learning recognition K-means feature detection& representation image representation category models category (and/or) classifiers decision

Bag of Words Models Borrowing Techniques from Text Classification • PLSA • Naïve Bayesian Model • wn: each patch in an image - wn = [0,0,…1,…,0,0΁T • w: a collection of all N patches in an image - w = [w1,w2,…,wN] • dj: the jth image in an image collection • c: category of the image • z: theme or topic of the patch Visual Recognition And Search Columbia University, Spring 2013 23

Bag of Words Models Probabilistic Latent Semantic Analysis (pLSA) z d w N D Hoffman, 2001 Latent Dirichlet Allocation (LDA)  z c w N D Blei et al., 2001 Visual Recognition And Search Columbia University, Spring 2013 24

Bag of Words Models Probabilistic Latent Semantic Analysis (pLSA) z d w N D “face” Visual Recognition And Search 5 ColSivic et al. ICCV 200 25

Bag of Words Models K d z w p(w| d ) p(w |z )p(z | d )  j N i i k k j D k1 Observed codeword Theme distributions Codeword distributions distributions per image per theme (topic) Parameter estimated by EM or Gibbs sampling Visual Recognition And Search Columbia University, Spring 2013 26 Slide credit: Josef Sivic

Bag of Words Models Recognition using pLSA  z arg max p(z|d) z Visual Recognition And Search Columbia UnSlide credit: Josef Siv 27 ic

Bag of Words Models Scene Recognition using LDA “beach” Latent Dirichlet Allocation (LDA)  z c w N D Fei-Fei et al. ICCV 2005 Visual Recognition And Search Columbia University, Spring 2013 28

Bag of Words Models Spatial-Coherent Latent Topic Model Cao and Fei-Fei, ICCV 2007 Visual Recognition And Search Columbia University, Spring 2013 29

Bag of Words Models Simultaneous Segmentation and Recognition Visual Recognition And Search Columbia University, Spring 2013 30

Bag of Words Models Pros and Cons Images differ from texts! Bag of Words Models are good in - Modeling prior knowledge - Providing intuitive interpretation But these models suffer from - Loss of information in quantization of “visual words” - Loss of spatial information Better coding Better pooling Visual Recognition And Search Columbia University, Spring 2013 31

Soft Quantization and Sparse Coding Visual Recognition And Search Columbia University, Spring 2013 32

Soft Quantization Hard Quantization Visual Recognition And Search Columbia University, Spring 2013 33

Soft Quantization Uncertainty-Based Quantization Model the uncertainty across multiple codewords Gemert et al, Visual Word Ambiguity, PAMI 2009 Visual Recognition And Search Columbia University, Spring 2013 34

Soft Quantization Intuition of UNC Hard quantization Soft quantization based on uncertainty Gemert et al, Visual Word Ambiguity, PAMI 2009 Visual Recognition And Search Columbia University, Spring 2013 35

Soft Quantization Improvement of UNC Visual Recognition And Search Columbia University, Spring 2013 36

Soft Quantization Sparse Coding-Based Quantization Hard quantization can be viewed as an “extremely sparse representation” s.t. A more general but hard to solve representation In practice we consider Sparse coding Visual Recognition And Search Columbia University, Spring 2013 37

Soft Quantization Sparse Coding-Based Quantization Hard quantization can be viewed as an “extremely sparse representation” s.t. A more general but hard to solve representation In practice we consider Sparse coding Visual Recognition And Search Columbia University, Spring 2013 38

Soft Quantization Yang et al obtain goodrecognition accuracy bycombining sparse coding with spatial pyramid and dictionarytraining. More details will be in group presentation. Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009 Visual Recognition And Search Columbia University, Spring 2013 39

Fisher Vector and Supervector One of the most powerful image/video classification techniques Thanks to Zhen Li and Qiang Chenconstructive suggestions to this section Visual Recognition And Search Columbia University, Spring 2013 40

Fisher Vector and Supervector Winning Systems 2009 2010 2011 2012 Classification task 2010 2011 Large Scale Visual Recognition Challenge Visual Recognition And Search Columbia University, Spring 2013 41

Fisher Vector and Supervector Literature [1] Yan et al, Regression from patch-kernel. CVPR 2008[2] Zhou et al, SIFT-Bag kernel for video event analysis. ACM Multimedia 2008 Supervector [3] Zhou et al, Hierarchical Gaussianization for imageclassification. ICCV 2009: 1971-1977 [4] Zhou et al, Image classification using super-vector coding of local image descriptors. ECCV 2010 [5] Perronnin et al, Improving the Fisher kernel for large-scale image classification, ECCV 2010 Fisher Vector [6] Jégou et al, Aggregating local image descriptorsinto compact codes PAMI 2011. These papers are not very easy to read. Let me take a simplified perspective via coding&pooling framework Visual Recognition And Search Columbia University, Spring 2013 42

Fisher Vector and Supervector Coding without Information Loss • Coding with hard assignment • Coding with soft assignment • How to keep all the information? Visual Recognition And Search Columbia University, Spring 2013 43

Fisher Vector and Supervector Coding without Information Loss • Coding with hard assignment • Coding with soft assignment • How to keep all the information? Visual Recognition And Search Columbia University, Spring 2013 44

Fisher Vector and Supervector An Intuitive Illustration Coding Visual Recognition And Search Columbia University, Spring 2013 45

Fisher Vector and Supervector An Intuitive Illustration Coding Component 2 Component 3 Component 1 Visual Recognition And Search Columbia University, Spring 2013 46

Fisher Vector and Supervector An Intuitive Illustration Component 1 Component 2 Component 3 + + + + + + Pooling Visual Recognition And Search Columbia University, Spring 2013 47

Fisher Vector and Supervector Implementation of Supervector In speech (speaker identification), supervector refer to stacked means of adaptive GMMs. Supervector = Visual Recognition And Search Columbia University, Spring 2013 48

Fisher Vector and Supervector Interpretation with Supervector Origin distribution New distribution Picture from Reynolds, Quatieri, and Dunn, DSP, 2001 Visual Recognition And Search Columbia University, Spring 2013 49

Fisher Vector and Supervector Normalization of Supervector In practice, a normalization process using the covariance matrix often improves the performance Moreover, we can subtract the original mean vector forthe ease of normalization The representation is also called Hierarchical Gaussianization (HG). Visual Recognition And Search Columbia University, Spring 2013 50

Class 3: Feature Coding and Pooling