Multi-Class Object Localization by Combining Local Contextual Interactions

Multi-Class Object Localization by Combining Local Contextual Interactions Carolina Galleguillos, Brian McFee, Serge Belongie, GertLanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department University of California, San Diego

Outline • Introduction • Multi-Class Multi-Kernel Approach • Contextual Interaction • Experiment & Results • Conclusion

Introduction • Object localization of contextual cues can greatly improve accuracy over model that use appearance feature alone. • Context considers information from neighboring area of object, such as pixel, region, and object interaction.

Introduction • In this work, we present a novel framework for object localization that efficiently and effectively combines different level of interaction. • Develop a multiple kernel learning algorithm to integrate appearance feature with pixel and region interaction data, resulting in a unified similarity metric, which is optimized for nearest neighbor classification. • Object level interactions are modeled by a conditional random field(CRF) to produce the final label prediction.

Multi-Class Multi-Kernel Approach • Large Margin Nearest Neighbor • Multiple Kernel Extension • Spatial Smoothing by Segment Merging • Contextual Conditional Random Field

Multi-Class Multi-Kernel Approach • In our model, each training image I is partitioned into segmentssi by using ground truth information. • Each segment sicorresponds to exactly one object of class where C is the set of all object labels. • These segments are collected into the training set S. • For each segment si2S,,we extract several types of features, where the pth feature space is characterized by a kernel functionand inner product matrix:

Multi-Class Multi-Kernel Approach • From this collection of kernels, we learn a unified similarity metric over , and a corresponding embedding function , map training set to learned space. • To provide more representative examples for nearest neighbor prediction, we augment the training set S with additional segments , obtained by running a segmentation algorithm multiple times on the training images [24]. • Because at test time, ground-truth segmentations are not available, the test image must be segmented automatically.

Multi-Class Multi-Kernel Approach Multiple Kernel Extension – several different features are extracted Segment are mapped into a unified space & soft label prediction is compute Contextual Conditional Random Field – predict the final labeling of each segment Spatial Smoothing by Segment Merging

Large Margin Nearest Neighbor • Our classification algorithm is based on k-nearest neighbor prediction. • Apply the Large Margin Nearest Neighbor(LMNN) algorithm to optimally distort the features for nearest neighbor prediction [35]. • Neighbors are selected by using the learned Mahalanobis distance metric W : • W is a positive semidefinite(PSD) matrix.

Large Margin Nearest Neighbor • W is trained so that for each training segment. • Neighboring segments (in feature space) with differing labels are pushed away by a large margin. • Achieved by solving the following semidefiniteprogram: and is similar and dissimilar label is slack parameter, is slack variable

Large Margin Nearest Neighbor • Alinear projection matrix L can be recovered from W by its spectral decomposition, so that W = L: V contains the eigenvectors of W, and is a diagonal matrix containing the eigenvalues

Large Margin Nearest Neighbor • Although the learned projection is linear, the algorithm can be kernelized [28] to effectively learn non-linear feature transformations. • After kernelizing the algorithm, each segment sican be rewritten by its corresponding column in the kernel matrix 1111111 and introducing a regularization term . • The embedding function then takes the form:

Multiple Kernel Extension • To effectively integrate different types of feature descriptions, we learn a linear projection from each kernel’s feature space. • Define the combined distance between two points by summing the distance in each (transformed) space. This is expressed algebraically as: • The regularization term tr(WK) is similarly extended to the sum • The multiple-kernel embedding function then takes the form

Multiple Kernel Extension • Multiple Kernel LMNN(MKLMNN) algorithm:

Multiple Kernel Extension • The probability distribution over the labels for the segment is computed by using its k nearest neighbors , weighted according to distance from g(s0): where is the label of segment • To simplify the process, we restrict to be diagonal, which can be interpreted as learning weightings over S in each feature space.

Spatial Smoothing by Segment Merging • Because objects may be represented by multiple segments at test time, some of those segments will contain only partial information from the object. • Resulting in less reliable label predictions. • Smooth a segment’s label distribution by incorporating information from segments which are likely to come from the same object, resulting in an updated label distribution

Spatial Smoothing by Segment Merging • Using the extra segments , we train an SVM classifier to predict when two segments belong to the same object. • By using the ground truth object annotation, we know when a pair of training segment came from the same object. • Given two segment and we compute: • pixel and region interaction features. • overlap between segment masks. • normalized segment centroids. • number of segments obtained in the segmentation. • Euclidean distance between the two segment centroids.

Spatial Smoothing by Segment Merging • We construct an undirected graph where each vertex is a segment, and edges are added between pairs that the classifier predicts should be merged, resulting in a new object segment . • The smoothedlabel distribution is the geometric mean of the segment distribution and itscorresponding object’s distribution:

Contextual Conditional Random Field • Pixel and region interactions can be described by low-level features, but object interaction require a high-level description, e.g., it’s label. • We follow the soft label prediction with Conditional Random field(CRF) that encode high-level object interaction.

Contextual Conditional Random Field • We learn potential functions from object co-occurrences, capturing long-distance dependencies between whole regions of the image and across classes. • Our CRF model is described as: treating the image as a bag of segment: , represents the vector of labels for the segment in • The final label vector is the value of which is maximize.

Contextual Interactions • In this part, we describe the featureswe use to characterize each level of contextual interaction. • Pixel level interaction. • Region level interaction. • Object level interaction.

Pixel Level Interaction • Pixel level interactions can implicitly capture background contextual information as well as information about object boundaries. • We use a new type of contextual source, boundary support.

Pixel Level Interaction • Encode by computing a histogram over LAB color value between 0 and pixel away from the object’s boundary. • Compute the -distance between boundary support histogram H: • Define the pixel interaction kernel as:

Region Level Interaction • By using large windows around an object, known as contextual neighborhoods [7], regions encode probable geometrical configurations, and capture information from neighboring (parts of) objects.

Region Level Interaction • Computed by dilating the bounding box around the object by using a disk of diameter: • We model region interactions by computing the gist[31] of a contextual neighborhood, Gi. • Our region interaction are represented by the kernel:

Object Level Interactions • To train the object interaction CRF, we derive semantic context from the co-occurrence of objects within each training image. • A co-ocurrence matrix A • A(i,j) counts the times an object with label ci appears in a training image with an object with label cj. • Diagonal entries correspond to the frequencyof the object in the training set.

Experiments • Database : MSRC and PASCAL 2007 • Appearance feature : • SIFT • Self-similarity (SSIM) • LAB histogram • Pyramid of Histogram of Oriented Gradients (PHOG). • Context feature : • GIST • LAB color

Result • Object localization: • Mean accuracy results

Result MSRC presents more co-occurrences of object classes per image than PASCAL, providing more information to the object interaction model.

Result • Feature combination: • Learning the optimal embedding

Result • Learned kernel weights

Result • Comparison to other model: • MSRC • PASCAL 07

Conclusion • We have introduced a novel framework that efficiently and effectively combines different levels of local context interactions. • Our multiple kernel learning algorithm integrates appearance features with pixel and region interaction data. • We obtain significant improvement over current state-of-the-art contextual frameworks. • Adding another object interaction type, such as spatial context [8], localization accuracy could be improved further.

Thank you!!!

Multi-Class Object Localization by Combining Local Contextual Interactions

Multi-Class Object Localization by Combining Local Contextual Interactions

Presentation Transcript

Class Object

Contextual Advertising by Combining Relevance with Click Feedback

Object Localization Using RFID

Contextual Advertising by Combining Relevance with Click Feedback

RFID Object Localization

Multi-Abstraction Concern Localization

Combining Local Descriptors for 3D Object Recognition and Categorization

Abnormal Object Detection by Canonical Scene -based Contextual Model

Combining efficient object localization and image classiﬁcation

Sharing features for multi-class object detection

Contextual Multi-Device Delivery

Object Localization Using RFID

Interactions class

Contextual Advertising by Combining Relevance with Click Feedback

Combining efficient object localization and image classiﬁcation

Models for Multi-View Object Class Detection

Probabilistic Object Recognition and Localization

Pointing Based Object Localization

Object Localization by Efficient Subwindow Search

Object Class Recognition Using Discriminative Local Features