Saliency-baesd Visual Attention

Saliency-baesd Visual Attention 黃文中 2009-01-08

Outline • Introduction • The Model • Results • Conclusion

If we are asking to find…

Where to look? • Many visual processes are expensive • Humans don’t process the whole visual field • How do we decide what to process? • How can we use insights about this to make machine vision more efficient?

Visual salience • Salience ~ visual prominence • Must be cheap to calculate • Related to features that we collect from very early stages of visual processing • Colour, orientation, intensity change and motion are all important indicators of salience

Saliency Map • The Saliency Map is a topographically arranged map that represents visual saliency of a corresponding visual scene.

Saliency map (Cont'd) • Two kinds of stimuli type: • Bottom-up • Depend only on the instantaneous sensory input • Without taking into account the internal state of the organism • Top-down • Take into account the internal state • Such as goals the organisms has at this time, personal history and experiences, etc

A Model of Saliency-based Visual Attention

Three main step • Extraction • extract feature vectors at locations over the image plane • Activation • form an "activation map" (or maps) using the feature vectors • Normalization / Combination • normalize the activation map (or maps, followed by a combination of the maps into a single map)

In detail… • Nine spatial scales are created using dyadic Gaussian pyramids. • Each features is computed by a set of linear “center-surround” operations akin to visual receptive fields. • Normalization • Across-scale combination into three “conspicuity maps.” • Linear combinations to create saliency map. • Winner-take-all

Image pyramids • The original image is decomposed into sets of lowpass and bandpass components via Gaussian and Laplacian pyramids. • The Gaussian pyramid consists of lowpass filtered (LPF). • The Laplacian pyramid consists of bandpass filtered (BPF).

Image pyramids (Cont'd) W

Extraction of early visual features • Intensity image: • Color channels: • Local orientation information: Obtained from using oriented Gabor pyramids

Center-surround differences • is obtained by interpolation to the finer scale and point-by-point substraction. • Intensity contrast: • Color double-opponent: • Orientation feature maps:

Map normalization operator • Map normalization operator:

Map normalization operator (Cont'd) • Normalizing the values in the map to a fixed range [0..M], in order to eliminate modality-dependent amplitude differences • Finding the location of the map’s global maximum M and computing the average m of all its other local maxima • Globally multiplying the map by .

Map normalization operator (Cont'd) • The method is called the global non-linear normalization. • Pros: • Computationally very simple. • Easily allows for real-time implementation because it is non-iterative. • Cons: • This strategy is not very biologically plausible, since global computations are used. • Not robust to noise, when noise can be stronger than the signal.

Iterative localized interactions • Non-classical surround inhibition • Interactions within each individual feature map rather than between maps • Inhibition appears strongest at a particular distance from the center, and weakens both with shorter and longer distances. • The structure of non-classical interactions can be coarsely modeled by a two-dimensional difference-of-Gaussians(DoG) connection pattern.

Iterative localized interactions (Cont'd)

Across-scale combination • “ ”, which consists of reduction of each map to scale 4 and point-by-point addition:

The Saliency Map (SM) • The three conspicuity maps are normalized and summed into the final input S to the saliency map: • The weights of each channel is tunable.

Winner-take-all • At any given time, only one location is selected from the early representation and copied into the central representation.

Winner-take-all (Cont'd) • The FOA is shifted to the location of the winner neuron. • The global inhibition of the WTA is triggered and completely inhibits (resets) all WTA neurons. • Local inhibition is transiently activated in the SM, in an area with the size and new location of the FOA.

Model performance on noisy versions of pop-out and conjuctive tasks

Conclusion • Have proposed a conceptually simple computational model for saliency-driven focal visual attention. • The framework can consequently be easily tailored to arbitrary tasks through the implementation of dedicated feature maps.

References • L. Itti, C. Koch, E. Niebur, “A Model of Saliency-Based Visual Attention for Rapid Scene Analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 11, pp. 1254-1259, Nov 1998. • L. Itti, C. Koch, A saliency-based search mechanism for overt and covert shifts of visual attention, Vision Research, Vol. 40, No. 10-12, pp. 1489-1506, May 2000. • H. Greenspan, S. Belongie, R. Goodman, P. Perona, S. Rakshit, and C.H. Anderson, “Overcomplete Steerable Pyramid Filters and Rotation Invariance,” Proc. IEEE Computer Vision and Pattern Recognition, pp. 222-228, Seattle, Wash., June 1994.

Saliency-baesd Visual Attention