Reinforcement Learning Model for Visual Search Optimization

Computer Science Readings: Reinforcement Learning Presentation by: Arif OZGELEN

How do we perform visual search? • Look at usual places the item is likely to be. • If item is small we tend to get closer to the area that we are searching in order to heighten our ability to detect. • We look for certain properties of the target object which makes it distinguishable from the search space. e.g. color, shape, size, etc…

A Reinforcement Learning Model of Selective Visual AttentionACM 2001 Silviu Minut, Autonomous Agents Lab, Department of Computer Science, Michigan State University. Sridhar Mahadevan, Autonomous Agents Lab, Department of Computer Science, Michigan State University.

The Problem of Visual Search • Goal:To find small objects in a large usually cluttered environment. • e.g. a pen on a desk. • Preferrable to use wide-field of view images. • Identifying small objects require high resolution images • Results in very high dimensional input array.

Nature’s Method: Foveated Vision - I • Fovea: Anatomically defined as the central region of the retina with high density of receptive cells. • Density of receptive cells decreases exponentially from the fovea towards periphery.

Nature’s Method: Foveated Vision - II • Saccades: To make up for the loss of information incurred by the decrease in resolution in the periphery, eyes are re-oriented by rapid ballistic motions (up to 900°/s) called saccades. • Fixations: Periods between saccades during which the eyes remain relatively fixed, to process visual information and to select the next fixation point.

Foveated Vision: Eye Scan Patterns

Using Foveated Vision • Using foveal image processing reduces the dimension of the input data but in turn generates a sequential decision problem: • Choosing the next fixation point requires an efficient gaze control mechanism in order to direct the gaze to the most salient object.

Gaze Control- Salient Features • In order to solve the problem of gaze control, next fixation point must be decided based on low resolution images which don’t appear in fovea. • Saliency Map Theory (Koch and Ulmann) Task independent bottom up model for visual attention. • Itti and Koch- Based on Saliency Map Theory 3 types of feature maps (color map, edge map, intensity map) are fused together to form saliency map. • Low resolution images alone are usually not sufficient for this decision problem.

Gaze Control- Control Mechanism Implementation • Implementation of a high level mechanism is required to control low level reactive attention. • Tsotsos model – proposes selective tuning of visual processing via a hierarchical winner takes all process. • Information should be integrated from one fixation to the next for a global understanding of the scene. • Model: top-down gaze control with bottom-up reactive saliency map processing based on RL.

Problem Definition and General Approach - I • Given an object and an environment: • How to build a vision agent that learns where the object is likely to be found. • How to direct its gaze to the object. • Set of Landmarks {L0,L1,..,Ln} representing regions in the environment. A policy on this set directs the camera to the most probable region containing the target object.

Problem Definition and General Approach – II • The approach does not require high level feature detectors. • Policy learned through RL is based on actual images seen by the camera. • Once the direction has been selected the precise location of the next fixation point is determined by means of visual saliency. • Camera takes low resolution/wide-field of view images at discrete time intervals. Using these low resolution images the system tries to recognize the target object using a low resolution template.

Problem Definition and General Approach – III • Since reasonable detection of a small sized object is difficult at low resolution, system tries to get candidate locations for the target object. • The foveated vision is simuated by zooming in and grabbing high resolution/ narrow field-of-view images centered at the candidate locations which are compared with a high resolution template of the target image.

Target Object and the Environment Color template of the target object (left). Environment (bottom).

Reinforcement Learning • The agent may or may not know the priori the transition probabilities and the reward. In this case dynamic programming techniques could be used to compute an optimal policy.

Q-Learning • In the visual search problem, the transition probabilities and the reward are not known to the agent. • A model free Q-learning algorithm used to find the optimal policies.

States – Objects in the Environment • Recorded scan patterns show that people fixate from object to object therefore it is natural to define the states as the objects in the environment. • Paradox: Objects must be recognized as worth attending to, before they are fixated on. However, an object cannot be recognized prior to the fixation, since it is perceived at low resolution.

States – Clusters of Images • States are defined as clusters of images representing the same region. • Each image is represented with color histograms on a reduced number of bins (48 colors for the lab environment). • Using histogram introduces perceptual aliasing as two different images have identical histograms. • To reduce aliasing, histograms are computed distributedly across quadrants. Expected to reduce aliasing since natural environments are sufficiently rich.

Kullback Distance - I

Kullback Distance - II

Actions • Actions are defined as the saccades to the most salient point. • {A1,..,A8} to represent 8 directions. In addition A0 represents the most salient point in the whole image.

Reward • Agent receives positive reward for a saccade bringing the object in to the field of view. • Agent receives negative reward if the object is not in the field of view after a saccade.

Within Fixation Processing • It is the stage when the eyes fixate on a point and the agent processes visual information and decides where the fixate next. • Comprises computation of two components: • A set of two feature maps implementing low level visual attention, used to select the next fixation point. • A recognizer, used at low resolution for detection of candidate target objects and at high resolution for recognition of target.

Histogram Intersection • It is a method used to match two images, I (search image) and M (model). • It is difficult to find a threshold between similar and dissimilar images in this method unless the model is pre-specified.

Histogram Back-projection • Given two images I and M, histogram back projection locates M in I. • Color histograms hI and hM are computed on the same number of color bins. • Operation requires one pass through I. For every pixel (x,y), B(x,y) = R(j) iff I(x,y) falls in bin j. • Always finds candidates.

Histogram Back-Projection Example

Symmetry Operator • In order to fixate on objects a symmetry operator is used since most man-made objects have vertical, horizontal or radial symmetry. • It computes an edge map first and then has each pair pi, pj of an edge pixels vote for its midpoint by (9).

Symmetry Map

Model Description - I • Each low resolution image is processed by two main modules • Top module (RL) learns a set of clusters consisting of images with similar color histograms. Clusters represents physical regions and are used as states in the Q-learning method. • Second module consists of low-level visual routines. Its purpose is to compute color and symmetry maps for saliency and to recognize the target object at both low and high resolution.

Model Description - II • Each low resolution image is processed by two main modules • Top module (RL) learns a set of clusters

Visual Search Agent Model

Algorithm - Initialization

Algorithm – If object found

Algorithm – If object not found

Results • The agent is trained to learn in which direction to direct its gaze in order to reach the region where the target object is most likely to be found, 400 epochs each. • Epoch: a sequence of at most 100 fixations. • Every 5th epoch was used for testing where agent simply executed the learned policy. • Performance metric was number of fixations. • Within a single trial, starting point was the same in all test epochs.

Experimental Results - I

Experimental Results - II

Experimental Results - III

Sequence of Fixations

Experimental Results - IV

Experimental Results - V

Experimental Results - VI

Conclusion • Developed a model of selective attention for a visual search task which, is a combination of visual processing and control for attention. • Control is achieved by means of RL over a low level, visual mechanism of selecting the next fixation. • Color and symmetry are used for selection of next fixation and it is not necessary to combine them in a unique saliency map. • The information is integrated from saccade to saccade

Future Work • Goal is to extend this approach to a mobile robot. Problem becomes more challenging as the position consequently the appearance of the object changes according to the robots position. Single template is not sufficient. • In this paper it is assumed that the environment is rich in color so that perceptual aliasing would not be an issue. Extension to a mobile robot, will inevitably lead to learning in inherently perceptually aliased environments.

Reinforcement Learning Model for Visual Search Optimization