Enhancing Object Localization with Implicit Tag Cues

Sung Ju Hwang and Kristen Grauman University of Texas at Austin CVPR 2010 Reading Between the Lines: Object Localization Using Implicit Cues from Image Tags

Detecting tagged objects Images tagged with keywords clearly tell us which objects to search for Dog Black lab Jasper Sofa Self Living room Fedora Explore #24 Hwang & Grauman, CVPR 2010

Detecting tagged objects Previous work using tagged images focuses on the noun ↔ object correspondence. Duygulu et al. 2002 Berg et al. 2004 Li et al., 2009 Fergus et al. 2005 [Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, Vijayanarasimhan & Grauman 2008, …] Hwang & Grauman, CVPR 2010

Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster ? Mug Key Keyboard Toothbrush Pen Photo Post-it ? Based on tags alone, can you guess where and what size the mug will be in each image? Hwang & Grauman, CVPR 2010

Our Idea The list of human-provided tags gives useful cues beyond just which objects are present. Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Mug Key Keyboard Toothbrush Pen Photo Post-it Mug is named first Mug is named later Presence of larger objects Absence of larger objects Hwang & Grauman, CVPR 2010

Our Idea We propose to learn the implicit localization cues provided by tag lists to improve object detection. Hwang & Grauman, CVPR 2010

Approach overview Training: Learn object-specific connection between localization parameters and implicit tag features. P (location, scale | tags) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Given novel image, localize objects based on both tags and appearance. Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

Feature: Word presence/absence Presence or absence of other objects affects the scene layout Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster , where = count of i-th word. Hwang & Grauman, CVPR 2010

Feature: Word presence/absence Presence or absence of other objects affects the scene layout Presence or absence of other objects affects the scene layout  record bag-of-words frequency. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster , where = count of i-th word. Small objects mentioned Large objects mentioned Hwang & Grauman, CVPR 2010

Feature: Rank of tags People tag the “important” objects earlier People tag the “important” objects earlier  record rank of each tag compared to its typical rank. , where = percentile rank of i-th word. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Hwang & Grauman, CVPR 2010

Feature: Rank of tags People tag the “important” objects earlier People tag the “important” objects earlier  record rank of each tag compared to its typical rank. , where = percentile rank of i-th word. Mug Key Keyboard Toothbrush Pen Photo Post-it Computer Poster Desk Bookshelf Screen Keyboard Screen Mug Poster Relatively high rank Hwang & Grauman, CVPR 2010

Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation , where = rank difference. 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 6 5 4 7 2 1 9 7 5 8 4 2 6 3 3 1 Hwang & Grauman, CVPR 2010

Feature: Proximity of tags People tend to move eyes to nearby objects after first fixation  record proximity of all tag pairs. People tend to move eyes to nearby objects after first fixation , where = rank difference. 1) Mug 2) Key 3) Keyboard 4) Toothbrush 5) Pen 6) Photo 7) Post-it 1) Computer 2) Poster 3) Desk 4) Bookshelf 5) Screen 6) Keyboard 7) Screen 8) Mug 9) Poster 6 5 4 7 2 1 9 7 5 8 4 2 6 3 3 1 May be close to each other Hwang & Grauman, CVPR 2010

Approach overview Training: P (location, scale | W,R,P) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

Modeling P(X|T) We need PDF for location and scale of the target object, given the tag feature: P(X = scale, x, y | T = tag feature) Mixture model α µ α µ α µ Σ Σ Σ Neural network Input tag feature (Words, Rank, or Proximity) We model it directly using a mixture density network (MDN) [Bishop, 1994]. Hwang & Grauman, CVPR 2010

Modeling P(X|T) Example: Top 30 most likely localization parameters sampled for the object “car”, given only the tags. Lamp Car Wheel Wheel Light Window House House Car Car Road House Lightpole Boulder Car Car Windows Building Man Barrel Car Truck Car Hwang & Grauman, CVPR 2010

Approach overview Training: P (location, scale | W,R,P) Computer Poster Desk Screen Mug Poster Desk Mug Office Implicit tag features Mug Eiffel Mug Coffee Woman Table Mug Ladder Testing: Object detector Mug Key Keyboard Toothbrush Pen Photo Post-it Implicit tag features Hwang & Grauman, CVPR 2010

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) Hwang & Grauman, CVPR 2010

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) (a) Sort all candidate windows according to P(X|T). Most likely Less likely Least likely (b) Run detector only at the most probable locations and scales. Hwang & Grauman, CVPR 2010

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) • Use it to increase detection accuracy (modulate the detector output scores) Predictions based on tag features Predictions from object detector 0.9 0.3 0.7 0.8 0.2 0.9

Integrating with object detector • How to exploit this learned distribution P(X|T)? • Use it to speed up the detection process (location priming) • Use it to increase detection accuracy (modulate the detector output scores) 0.63 0.24 0.18

Experiments: Datasets LabelMe PASCAL VOC 2007 • Street and office scenes • Contains ordered tag lists via labels added • 5 classes • 56 unique taggers • 23 tags / image • Dalal & Trigg’s HOG detector • Flickr images • Tag lists obtained on Mechanical Turk • 20 classes • 758 unique taggers • 5.5 tags / image • Felzenszwalb et al.’s LSVM detector Hwang & Grauman, CVPR 2010

Experiments We evaluate • Detection Speed • Detection Accuracy We compare • Raw detector (HOG, LSVM) • Raw detector + Our tag features We also show the results when using Gist [Torralba 2003] as context, for reference. Hwang & Grauman, CVPR 2010

PASCAL: Performance evaluation Naïve sliding window searches 70%. We search only 30%. We search fewer windows to achieve same detection rate. We know which detection hypotheses to trust most. Hwang & Grauman, CVPR 2010

PASCAL: Accuracy vs Gist per class Hwang & Grauman, CVPR 2010

PASCAL: Example detections Bottle Person Table Chair Mirror Tablecloth Bowl Bottle Shelf Painting Food Lamp Person Bottle Dog Sofa Painting Table LSVM alone LSVM+Tags (Ours) Car Car Door Door Gear Steering Wheel Seat Seat Person Person Camera Car License Plate Building Hwang & Grauman, CVPR 2010

PASCAL: Example detections LSVM+Tags (Ours) LSVM alone Dog Dog Floor Hairclip Dog Dog Dog Person Person Ground Bench Scarf Person Horse Person Tree House Building Ground Hurdle Fence Person Microphone Light Hwang & Grauman, CVPR 2010

PASCAL: Example failure cases LSVM+Tags (Ours) LSVM alone Bottle Glass Wine Table Aeroplane Sky Building Shadow Person Person Pole Building Sidewalk Grass Road Dog Clothes Rope Rope Plant Ground Shadow String Wall Hwang & Grauman, CVPR 2010

Results: Observations • - scale well for indoor objects • Often our implicit features predict: • - position well for outdoor objects • Gist usually better for y position, while our tags are generally stronger for scale • - visual and tag context are complementary • Need to have learned about target objects in variety of examples with different contexts Hwang & Grauman, CVPR 2010

Summary • We want to learn what is implied (beyond objects present) by how a human provides tags for an image. • Approach translates existing insights about human viewing behavior (attention, importance, gaze, etc.) into enhanced object detection. • Novel tag cues enable effective localization prior. • Significant gains with state-of-the-art detectors and two datasets. Hwang & Grauman, CVPR 2010

Future work • Joint multi-object detection • From tags to natural language sentences • Image retrieval applications Hwang & Grauman, CVPR 2010

Enhancing Object Localization with Implicit Tag Cues

Enhancing Object Localization with Implicit Tag Cues

Presentation Transcript

Implicit Surfaces

Image analysis with EBImage

Depth Cues

READING AN IMAGE

CONSTRUCTION BLUEPRINT READING

Lines are a design element that help direct your viewer through the image.

Reading

Haptics and Virtual Reality

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags

Drafting symbols and lines

Contour Line Drawing

Hough Transform

Object Removal in Multi-View Photos

Image Formation

CUES TO DEPTH PERCEPTION

Lines and Arcs Segmentation

Localization

Lines and Arcs Segmentation

O BJ C UT

3.7 Implicit Differentiation

Lines and Arcs Segmentation