Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags

Reading Between The Lines: Object LocalizationUsing Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at AustinJingnan LiIevgeniiaGutenko

BoyDogGrassBlueSkyPuppyRiverStreamSunColoradoNikon BabyInfantKid ChildHeadphonesRedCute Laughing

Weakly labeled images LampChairPainting BicyclePerson BabyTableChair Table LampChair

Object detection approaches • Prioritize search windows within image, based on learned distribution of tags for speed. • Combine the models based on both tags+ images for accuracy.

Motivation Idea: What can be predicted from the image before even looking at it and only with given tags? Both sets of tags suggest that mug appears on the image, but when considering that set of tags is based on what “catches they eye” first, then the area that object detector has to search can be narrowed.

Implicit Tag Feature Definitions • What implicit features can be obtained from tags? • Relative prominence of each object based on the order in the list. • Scale cues implied by unnamed objects. • The rough layout and proximity between objects based on the sequence in which tags are given.

Implicit Tag Feature Definitions • Word presence and absence – bag-of-words representation • wi denotes the number of times that tag-word i occurs in that image’s associated list of keywords for a vocabulary of N total possible words • For most tag lists, this vector will consist of only binary entries saying whether each tag has been named or not

Implicit Tag Feature Definitions • Tag rank – prominence of each object: certain things will be named before others • ridenotes the percentile ranks observed in the training data for that word (for entire vocabulary) • Some objects have context-independent “noticeability”—such as baby or fire truck—often named first regardless of their scale or position.

Implicit Tag Feature Definitions • Mutual tag proximity - tagger will name prominent objects first, then move his/her eyes to some other objects nearby • pi,j denotes the (signed) rank difference between tag words i and j for the given image. • The entry is 0 when the pair is not present.

Modeling the localization distributions • Relate defined tag-based features to the object detection (or combination) • Model conditional probability density that the window contains the object of interest, given only these image tags: • - the target object category.

Modeling the localization distributions • Use mixture of Gaussians model: • - parameters of the mixture model obtained by trained Mixture Density Network (MDN) • Training:Classification:Novel image with no BBoxes. ComputerBicycleChair MDN provides the mixturemodel representing mostlikely locations for the target object.

The top 30 most likely places for a car sampled according to modeled distribution based only on tags of the images.

Modulating or Priming the detector • Use from the previous step and: • Combine with predictions with object detector based on appearance , A – appearance cues: HOG:Part-based detector(deformable part model) • Use the model to rank sub-windows and run the detector on most probable locations only (“priming”). • Decision value of detectors is mapped to probability:

Modulating the detector • Balance appearance and tag-based predictions: • Use all tags cues: • Learn the weights wusing detection scores for true detectionsand a number of randomly sampled windows from the background. • Can add Gist descriptor to compare against global scene visual context. • Goal: improve accuracy.

Priming the detector • Prioritize the search windows according to • Assumption that object is present, and only localization parameters (x,y,s) have to be estimated. • Stop search when confident detection is found • Confidence ( >0.5) • Goal: improve efficiency.

Results • Datasets • LabelMe - use the HOG detector • PASCAL - use the part-based detector Note: Last three columns show the ranges of positions/scales present in the images, averaged per class, as a percentage of image size. L P

LabelMe Dataset • Priming Object Search: Increasing Speed • For a detection rate of 0.6, proposedmethod considers only 1/3 of those scanned by the sliding window approach. • Modulating the Detector: Increasing Accuracy • The proposedfeatures make noticeable improvements in accuracy over the raw detector.

Example detections on LabelMe • Each image shows the best detection found. • Scores denote overlap ratio with ground truth. • The detectors modulated according to the visual or tag-based context are more accurate.

PASCAL Dataset • Priming Object Search: Increasing Speed Adopt the Latent SVM (LSVM) part-based windowed detector, faster here than the HOG’s was on LabelMe. • Modulating the Detector: Increasing Accuracy Augmenting the LSVM detector with thetag features noticeably improves accuracy—increasing the average precision by 9.2% overall.

Example detections on PASCAL VOC • Red dotted boxes denote most confident detections according to the raw detector (LSVM) • Green solid boxes denote most confident detections when modulated by our method (LSVM + tags) • The first two rows show good results, and third row shows failure cases

Conclusions • Novel approach to use information “between the lines” of tags. • Utilizing this implicit tag information helps to make search faster and more accurate. • The method complements and even exceeds performance of the methods using visual cues. • Shows potential for learning tendencies of real taggers.

Thank you!

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags