560 likes | 651 Vues
Image Context, Efficient Indexing, and Sense-Specific Category Models. Trevor Darrell Kristen Grauman(*) Tom Yeh Kate Saenko MIT CSAIL UC Berkeley EECS & ICSI (*) UT Austin CS. Outline. Photo-based Question Answering Tom Yeh Efficient indexing with local image features
E N D
Image Context, Efficient Indexing, and Sense-Specific Category Models Trevor Darrell Kristen Grauman(*) Tom Yeh Kate Saenko MIT CSAIL UC Berkeley EECS & ICSI (*) UT Austin CS
Outline • Photo-based Question Answering • Tom Yeh • Efficient indexing with local image features • Kristen Grauman • Multimodal Sense Disambiguation for Visual Category Models • Kate Saenko
Photo-based Question Answering Tom Yeh John Lee Trevor Darrell MIT CSAIL UC Berkeley EECS & ICSI
Text-based QA Systems Yahoo! Answers
An easier example Current image matching and question matching technologies enable us to handle simpler photo-based QA automatically.
System architecture How many floors? Template-based QA Who is the architect? Is there any problem? Books Buildings WWW Frank Gehry Layer 1 IR-based QA Resolved Questions How many stories? 9 floors Layer 2 Human-based QA People are getting lost a lot. Community Layer 3 What labs are here? CSAIL
Prototype 1: Adding photos to a text-based QA system 1 2 3 4 5 6 7
Prototype 3: Applying photo-based QA to mobile devices. 1 2 3 4 5 6 7
Outline • Photo-based Question Answering • Tom Yeh • Efficient indexing with local image features • Kristen Grauman • Multimodal Sense Disambiguation for Visual Category Models • Kate Saenko
Efficient Image Indexing Methods for Scene and Object Recognition Trevor Darrell UC-Berkeley EECS & ICSI Kristen Grauman University of Texas at Austin Dept. of Computer Sciences
Fast image indexing Goal: to recognize locations and objects, match queries by image content.
Fast image indexing Large and evolving image repository • Key technical challenges: • Robustness to variable viewing conditions • Queries are time-sensitive, but database is huge • Approach: develop sub-linear time search methods for “good” image representations and metrics.
Local Features • Local features provide invariance to geometric and photometric variation • Want fast correspondence-based search with local features
Intra-class appearance Local image features Illumination Object pose Clutter Occlusions Viewpoint
Maximally Stable Extremal Regions [Matas et al.] Shape context [Belongie et al.] Superpixels [Ren et al.] SIFT [Lowe] Spin images [Johnson and Hebert] Geometric Blur [Berg et al.] Local image features Describe component regions or patches separately Salient regions [Kadir et al.] Harris-Affine [Schmid et al.]
Partially matching sets of features Optimal match: O(m3) Greedy match: O(m2 log m) Pyramid match: O(m) Approximation makes large sets of features practical (m=num pts). Optimal match maximizes total similarity of matched points. [Grauman & Darrell, ICCV 2005]
Counting matches with intersection Histogram intersection
Example pyramid match Num “new” matches
Example pyramid match pyramid match optimal match
How to index efficiently over correspondences? N 3 2 ? 1 Most similar images according to local feature correspondences Query image Large database of images Approximate matching
Image search with matching-sensitive hash functions • Main idea: • Map point sets to a vector space in such a way that a dot product reflects partial match similarity (normalized pyramid match value). • Exploit random hyperplane properties to construct matching-sensitive hash functions. • Perform approximate similarity search on hashed examples. [Grauman & Darrell, CVPR 2007]
Locality Sensitive Hashing (LSH) N Xi h h r1…rk r1…rk Q Guarantee “approximate”-nearest neighbors in sub-linear time, given appropriate hash functions. << N 110101 110111 Q 111101 [Indyk and Motwani 1998, Charikar 2002]
LSH functions for dot products The probability that a randomhyperplane separates two unit vectors is related to the angle between them. for High dot product: unlikely to split Lower dot product: likely to split [Goemans and Williamson 1995, Charikar 2004]
[ 1, 0, 3 ] A useful property of intersection histograms padded unary encoding = [1, 3, 5] = [ 1 0 0 0 0 1 1 1 0 0 1 1 1 1 1 ] = [2, 0, 3] = [ 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 ] [1+0+0+0+0+0+0+0+0+0+1+1+1+0+0]
Pyramid match definition ~ Intersection diff. = number of new matches Pyramid match (un-normalized) expressed as sum of weighted intersections
w0-w1 w1-w2 w2-w3 w3 Vector encoding of pyramids [11110,… 00000,… 11110,… 11110,… 11110,… 00000,… 11110,… 00000,… 11000,… 11110,… 11000,… 11000,… 11100,… 11000,… 11111] Weighted sparse count vector Implicit unary encoding Point set Multi-resolution histogram Sparse count vector
w0-w1 w1-w2 w2-w3 w3 Vector encoding of pyramids w0-w1 w1-w2 Dot product between embedded point sets yields pyramid match kernel value w2-w3 w3 Length of an embedded point set is equivalent to its self-similarity
Matching-sensitive hash functions Normalized pyramid match kernel value Probability of collision (hash bits equal) Probability of collision Normalized partial match similarity
N Xi h h r1…rk r1…rk Q Pyramid match hashing Randomized hash functions Embed point sets as pyramids Probability of collision = normalized partial match similarity << N 110101 110111 Q 111101 Guaranteed retrieval of -approx NN in time.
Indexing object images • Caltech101 data set • 101 categories 40-800 images per class • Features: • Densely sampled • SIFT descriptor + spatial • Average m=1140 per set Query object Data provided by Fei-Fei, Fergus, and Perona
Results: indexing object images • Query time controlled by required accuracy • e.g., search less than 2% of database examples for accuracy close to linear scan k-NN error rate Epsilon (ε) slower search faster search
Summary • Content-based queries for location recognition demand fast search algorithms for useful image metrics. • Contributions: • Scalable matching for local representations • Sub-linear time search with matching • Recently extended to semi-supervised hash functions for learned metrics • (See Jain, Kulis, & Grauman, CVPR 2008)
Trevor Darrell trevor@eecs.berkeley.edu Kristen Grauman grauman@cs.utexas.edu • Relevant papers: • P. Jain, B. Kulis, and K. Grauman. Fast Image Search for Learned Metrics. To appear, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska, June 2008. • K. Grauman and T. Darrell. Pyramid Match Hashing: Sub-Linear Time Indexing Over Partial Correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, June 2007. • K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Beijing, China, October 2005.
Outline • Photo-based Question Answering • Tom Yeh • Efficient indexing with local image features • Kristen Grauman • Multimodal Sense Disambiguation for Visual Category Models • Kate Saenko
Multimodal Sense Disambiguation for Semi-Supervised Learning of Object Categories from the Web Kate Saenko Trevor Darrell MIT CSAIL UC Berkeley EECS & ICSI
Clutter and Sense ambiguity • Tag-based retrieval returns a lot of clutter • One approach: bootstrap from seed image set • E.g., Fei-Fei et al., OPTIMOL • But how to get unusual apperances of category?
Topic models for image clustering • Latent Dirchlet Allocation • Unsupervised learning of latent topic space • Distance in topic space groups together similar images
Mouse? A multimodal similarity measure can discover unusual appearances
Multiple senses • Bass: Fish? Musical Instrument? • Mouse: Computer? Animal? • Topic model allows segregation of distinct senses: • use seed data to identify inlier multimodal topics, • two possible approaches: 1) select either single best inlier topic, or 2) threshold to multiple topics • compute distance based on selected latent dimensions