Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Goal

Nearest neighbor classification D ( , )

Learning a Distance Metric from Relative Comparisons [Schulz & Joachims, NIPS ’03] D ( , ) D ( , ) = D ( , ) ( - )T ( - )

Approach image i image j

Approach image i dji,m image j

image k Approach image i Dji =Σ wj,mdji,m image j

image k Approach image i < Dji Dki image j

image i < Dji Dki image j image k Core wj,m ? image j

Derivations • Notation • Large-margin formulation • Dual problem • Solution

Dji =Σ wj,mdji,m Dji =wj·dji Dki > Dji wk·dki > wj ·dji wk·dki - wj ·dji ≥ 1 W w1w2…wk…wj… Xijk 0 0 … dki…-dji… wk·dki - wj ·dji ≥ 1 W·Xijk≥ 1 Notations for triplet i, j, k

Large-margin formulation

SVM

Soft-margin SVM

Derivation

Dual

Details – Features and descriptors • Find ~400 features per image • Compute geometric blur descriptor

Descriptors • Geometric blur

Descriptors • Two sizes of geometric blur (42 pixels and 70 pixels) • Each is 204 dimensions (4 orientations and 51 samples each) • HSV histograms of 42-pixel patches

Choosing triplets • Caltech101 – at 15 images per class • 31.8 million triplets • Many are easy to satisfy • For each image j, for each feature • Find the N images I with closest features • For each negative example iin I, form triplets (j, k, i) • Eliminates ~ half of triplets

Choosing C

Choosing C • Train with multiple values of C, testing on a held-out part of the training set • Choose whichever gives the best results • For each C, run online version of the training algorithm • Make one sweep through training triplets • For each misclassified triplet (i,j,k), update weights for the three images • Choose C which gets the most right answers

Results • At 15 training examples per class: 63.2% (~3% improvement) • At 20 training examples per class: 66.6% (~5% improvement)

Results • Confusion matrix Hardest categories: crocodile, cougar_body, cannon, bass

Questions • Is there any disadvantage to a non-metric distance function? • Could the images be embedded in a metric space? • Why not learn everything? • Include a feature for each image pixel • Include multiple types of descriptors • Could this be used for to do unsupervised learning for sets of tagged images (e.g., for image segmentation)? • Can you learn a single distance per class?

Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Presentation Transcript

Recognizing Objects and Actions in Images Jitendra Malik U.C. Berkeley

Ethan Frome

JITENDRA PATEL

SHA

Ethan Frome

Ethan Frome

Pranking with Ranking Koby Crammer and Yoram Singer

ETHAN FROME

Malik Faucette

Singer

Ethan Frome Timeline

Ethan Frome

Fei Yang

Harris Malik

Yoram Alhassid (Yale)

SINGER

Ethan Frome

Ethan Frome

Andrea Frome, Yoram Singer, Fei Sha, Jitendra Malik

Ethan Frome

Matching Shapes Serge Belongie, Jitendra Malik and Jan Puzicha U.C. Berkeley

Ethan Frome