280 likes | 298 Vues
This work explores the application of semantic understanding of images in training computer vision algorithms for image retrieval, robotic navigation, semantic labeling, image sketching, and more. It covers topics such as feature extraction, matching and association, and ground imagery and video processing.
E N D
Training Image Classifiers with Similarity Metrics, Linear Programming, and Minimal Supervision Asilomar SSC Karl Ni, Ethan Phelps, Katherine Bouman, Nadya Bliss Lincoln Laboratory, Massachusetts Institute of Technology 2 November 2012 This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.
What can a computer understand? Requires: Some prior knowledge Applying Semantic Understanding of Images • Who? • What? • When? • Where? • Computer vision algorithms • Image retrieval • Robotic navigation • Semantic labeling • Image sketch • Structure from Motion • Image localization Query by example Statistical modeled Query by sketch Training Data Classifier Decision! Matching & Association Feature Extraction
Training Framework • Metadata • Graphs • Point Clouds • Distributions • Terrain • Etc. OFFLINESETUP Feature Extraction Matching & Association Ground Imagery, VideoAerial Imagery, Video FRAMEWORK Location Localization Algorithms World Model EXPLOITATION Multi-ModalSources Processing
Outline • Introduction • Feature Pruning Background • Matched Filter Training • Results • Summary
Problems in image pattern matching Features are a quantitative way for machines to understand an image Image Property Feature Technique Local Color (Luma + ChromaHists) Object texture (DCT Local & Normalized) Shape (Curvelets, Shapelets) Lower level gradients (DWT : Haar, Daubechies) Higher level descriptors (SIFT/SURF/HoG, etc) Scene descriptors (GIST) - Torralba et al. Finding the Features of Image • Each image = 10 million pixels! • Most dimensions are irrelevant • Multiple concepts inside the image • Typical chain: Feature Extraction Training / Classifier
Numerous features: subset is relevant Feature Extraction Training / Classifier • FEATURES ARE: • Red bricks on multiple buildings • Small hedges, etc • Windows of a certain type • Types of buildings are there • FEATURES ARE: • More suburb-like • Larger roads • Drier vegetation • Shorter houses • FEATURES ARE: • Arches and white buildings • Domes and ancient architecture • Older/speckled materials (higher frequency image content) • Choice of features requires looking at multiple semantic concepts defined by entities and attributes inside of images
Most of the features are irrelevant Large dimensionality and algorithmic complexity Keep small numbers of salient features and discard large numbers of nondescriptive features Feature invariance to transformations, content, and context only to an extent (e.g., SIFT, RIFT, etc.) Simplify classifier (both computation & supervision) Multiple instances of several features describing the same object Require a high level of abstraction Visual similarity does not always correlate with “semantic” similarity Feature Descriptors Feature Extraction Training / Classifier Brown et. al., Lowe et. al.,Ng et. al., Thrun et. al.
Tools to hand labelconcepts 2006-2011 Google Image Labeler Kobus’s Corel Dataset MIT LabelMe Yahoo! Games Problems Tedious Time consuming Incorrect Very low throughput Famous algorithms Parallelizable Not generalizable,unfortunately Getting the Right Features Feature Extraction 7 5 2 1 6 4 3 People can’t be flying or walking on billboards! 1. Chair, 2. Table, 3. Road, 4. Road, 5. Table, 6. Car, 7. Keyboard
Segmentation is a difficult manual task Multiple semantic concepts per single image Considerable amounts of noise most often irrelevant to any concept Automatically Learn the Best Features Concept 1 (e.g., sky) Concept 2 (e.g., mountain) Semantic Simplex Kwitt, et. al. (Kitware) Concept 3 (e.g., river)
Lots of work in the 1990s Conditional probabilities through large training data sets Motivated by the query by example and query by sketch problems Primarily based on multiple instance learning and noisy density estimation Learning multiple instances of an object (no noise case) Robustness to noise through law of large numbers Hope to integrate it out Although the area of red boxes per instance is small, their aggregate over all instances is dominant Leveraging Related Work (Not the IBM Query for relational databases Zloof, but Ballerini et al.) Diettrich, et. al. Keeler, et. al. Noise, if uncorrelated, will become more and more sparse
Feature clustering in the large Mixture hierarchies can be incrementally trained Parallel Calculations through Hierarchies Image Class N Image Class 2 Image Class 1 Automatic feature subselection has been submitted to SSP 2012 Training images Lincoln Laboratory GRID Processing Training images Top Level GMM Entire image Entire image Entire image Lower Level GMMs Distribution N Distribution 2 Distribution 1 Vasconcelos, et. Al. image 1 image 2 image 3 Can be done in parallel
Outline • Introduction • Feature Pruning Background • Matched Filter Training • Results • Summary
Hierarchical Gaussian mixtures as a density estimate Small sample-bias is large Non-convex / sensitive to initialization Extensive computational process to bring hierarchies together Each level requires supervision (#classes, initialization, etc.) Think discriminantly: Instead of: Generating centroids that represent images Think: Prune features to eliminate redundancy Sparsity optimization Solving directly for the features that we want to use Reduction of redundancy is intuitive and not generative Under normalization, GMM’s classifier can be implemented with matched filter instead Finding a sparse basis set normalize
A Note on Notation • Let the feature be the jth feature in the training set, where italicized is the ith dimension of that feature. • Let the X be a d x N matrix that represents the collection of all the features, where the jth column of X is a feature vector xj.
Gaussian Mixture Models Many optimization problems induce sparsity: Matched filter constraint: Relaxation of constraints Finding sparsity with linear programming Feature Extraction Training / Classifier GMM, solved via EM (non-convex optimization problem) such that and such that and Group Lasso Max-Constraint Optimization Not convex LP Optimization Problem: Faster than G-Lasso Independemt of dimensionality! Convex (unlike MF opt & GMM, EM) On average, according to N2
Intuition • Relies on similarity matrix concept • Actual implementation does not include similarity matrix, but rather keeps track of beta indices ℓ∞-norm of the rows of X s.t. and < t1 < t2 β = < t3 < t4
Nonlinear Feature Selection • The optimization problem consists solely of dot products in a similarity function, whose prototypes are provided by that are similar to a set: • Nonlinearity may be introduced in a kernel function (RKHS) that induces a vector space that we may not necessarily know the mapping to. such that such that and and
Application to Classification TRAINING Feature Extraction = BEST FEATURES Feature Extraction Matching & Association QUERY Classifying Image with Confidence Just a faster way to classify imagery in one versus all frameworks
Outline • Introduction • Feature Pruning Background • Matched Filter Training • Results • Summary
More intuitive grouping Threshold learning is unnecessary Post-processing is unnecessary 5.452% more accurate in +1/-1 learning classes LP Feature Learning versus G-Lasso
Segmentation and Classification Visual Result Original Image Decisions Decisions
1400 images per dataset Filter reduction to 356 filters per class Less than a minute classification time Coverage of cities: entire cities (Vienna, Dubrovnik, Lubbock), portion of Cambridge (MIT-Kendall) Application to Localization
Accurate modeling must occur before we have any hope in classifying images. Feature pruning is equivalent to Gaussian centroid determination under normalization Sparse optimization enables feature pruning and matched filter creation Sparse optimization contains only dot products so optimization can occur with RKHS in the transductive setting Summary
References • K. Ni, E. Phelps, K. L. Bouman, N. Bliss, “Image Feature Selection via Linear Programming,” to appear in Presentation at Asilomar SSC, Pacific Grove, CA. October (Asilomar ‘12) • S. M. Sawyer, K. Ni, N. T. Bliss. "Cluster-based 3D Reconstruction of Aerial Video." to appear in Presentation at the 1st IEEE High Performance Extreme Computing Conference, Waltham, MA. September 2012 (HPEC '12) • H. Viggh and K. Ni, “SIFT Based Localization Using Prior World Model for Robotic Navigation in Urban Environments,” to appear in Presentation at the 16th International Conference on Image Processing, Computer Vision, and Pattern Recognition, 2012, Las Vegas, Nevada (IPCV-2012) • K. Ni, Z. Sun, N. Bliss, "Real-time Global Motion Blur Detection", to appear in Presentation at the IEEE International Conference on Image Processing, 2012, Orlando, Florida, (ICIP-2012) • N. Arcalano, K. Ni, B. Miller, N. Bliss, P. Wolfe, "Moments of Parameter Estimates for Chung-Lu Random Graph Models", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012, Kyoto Japan, ICASSP-2012 • A. Vasile, L. Skelly, K. Ni, R. Heinrichs, O. Camps, and M. Sznaier, “Efficient City-sized 3D Reconstruction from Ultra-High Resolution Aerial and Ground Video Imagery”, Proceedings of the IEEE International Symposium on Visual Computing, 2011, Las Vegas, NV, ISCV-2011, pp 347-358 • K. Ni, Z. Sun, N. Bliss, "3-D Image Geo-Registration Using Vision-Based Modeling", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2011, Prague, Czech Republic, ICASSP-2011, pp 1573 - 1576 • K. Ni, T. Q. Nguyen, "Empirical Type-I Filter Design for Image Interpolation", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2010, pp 866 - 869 • Z. Sun, N. Bliss, & K. Ni, "A 3-D Feature Model for Image Matching", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2010, pp 2194-2197 • K. Ni, Z. Sun, N. Bliss, & N. Snavely, "Construction and exploitation of a 3D model from 2D image features", Proceedings of SPIE International Conference on Electronic Imaging, Inverse Problems Session, SPIE-2010, Vol. 7533, San Jose, CA, U.S.A., January 2010.
MIT Lincoln Laboratory Karl Ni Nicholas Armstrong-Crews Scott Sawyer Nadya Bliss MIT Katherine L. Bouman Boston University Zachary Sun Northeastern University Alexandru Vasile Cornell University Noah Snavely Contributors and Acknowledgements