Visual Attention based Region of Interest Coding for Video -telephony Applications

Visual Attention based Region of Interest Coding for Video -telephony Applications Nicolas TsapatsoulisComputer Science Dept.University of Cyprus

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Aim of this study • Develop an algorithm for Region Of Interest (ROI) estimation based on Visual Attention • Visual attention: • Area studying the behavior of humans when observing a scene • Visually important areas are expected to be the first areas humans fixate on: • Such areas as selected as ROIs • ROIs are encoded with higher accuracy than non-ROIs • Possible application: Video telephony • Low bit-rates required, while • Visual quality needs to be preserved

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Overview • ROI areas are computed based on a saliency map • Saliency map combines: • Intensity • Orientation • Color • Skin (Face) conspicuity maps • All conspicuity maps are constructed based on the center-surround principle: • Visually important regions are those that stand out from their surround in terms of intensity, orientation and color • Skin map corresponds to the a-priori knowledge that faces are common in video-telephony applications and humans either implicitly or explicitly fixate on such areas

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Overview (II) • Stand-out areas are computed at various scales using a multiresolution approach: • Both small and large objects can stand-out from their surround • Combination of conspicuity maps into a final saliency map is achieved by using a sigmoid function • Contribution of the various channels (intensity, orientation, color skin) is summed to indicate areas that moderately stand-out from their surround in several channels • Areas that highly stand-out from their surround in a single channel can dominate (saturate) the combined aggregation preserving their importance • Experiments: • Non-ROI areas are smoothed and passed to the encoder. This results in better intra-frame encoding (concentrated DCT coefficients) and better prediction (inter-frame encoding)

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Visual Attention Models • Feature Integration Theory (FIT) by Treisman et al: • Visual features are registered early, automatically and in parallel along a number of separable dimensions (e.g. intensity, color, orientation, size, shape etc). • The FIT theory was the basis of several visual attention algorithms and computational models that have been developed over the last two decades • Saliency based model of Itti & Koch • Low-level vision features (color channels tuned to red, green, blue and yellow hues, orientation and brightness) are extracted from the original color image at several spatial scales. Different spatial scales are produced using Gaussian pyramids, which consist of progressively low-pass filtering and sub-sampling the input image. • Each feature is computed in a center-surround structure akin to visual receptive fields.

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Model of Itti & Koch • RGB color model • Orientation computed at four directions and summing the results • Pyramid produced by Gaussian low-pass filtering and subsampling • Center surround: • Point by point differences of finer and coarser approximations, the latter being first interpolated. • Normalize and add to create the saliency map. • Winner Take All (WTA) architecture to model changes of fixation points

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm • Based on Itti & Koch • Add of a skin-branch to model prior knowledge (existence of faces in video telephony application) • Wavelet based implementation of the pyramid. • YCrCb color model to keep consistency with skin detection (skin color can be modeled via a small area in the Cr-Cb plane – NTSC broadcasting system of analog TV makes use of this property) • Orientation computed as across-scale differences in detail bands (V,H,D) • Combination of conspicuity maps through a sigmoid function to create the final saliency map • Note: The assumption that a final saliency map is created in human brain has not been proved and remains a controversial issue among scientists

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm (II) • Decomposition of Y, Cr, Cb color channels using Daubechie’s wavelets and filter coefficients (length 4)

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm (III) • Center surround at scale j: , ,

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm (I) • Conspicuity maps: • Interpolate I-j, O-j, C-j at the finest scale (j =0) and add center surround differences of all scales (for all j) • Three conspicuity maps: • Intensity (I) • Color (C) • Orientation (O) • Plus skin map (F) • Max depth of analysis Jmax: , ,

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Face Map • Skin probability computed at various scales • 2D-Gaussian probability density function for skin • Pseudoprobability computed based on Mahalanobis distance • Face modeled as textured skin area • Top left figure: • Original frame • Top right figure: • Skin Map • Bottom left figure: • Multiscale texture map (range filtering at various scales) • Bottom right figure: • Face map created by multiplying texture and skin maps

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Orientation Map • Across scale differences of detail bands in illumination (Y) channel: • V = Vertical detail: low pass filtering of rows, high pass filtering of columns • H = Horizontal detail: high pass filtering of rows, low pass filtering of columns • D = Diagonal detail: high pass filtering of rows, high pass filtering of columns • Left figure: • Original frame • Center figure: • Orientation Map • Right figure: • Intensity map (not enough for accurately identifying areas that stand-out from their surround due to orientation

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Intensity Map • Across scale difference of approximation band in illumination (Y) channel • In the figures below the eyes of the newscaster are small areas that stand out form their surround • Blouse and channel’s logo are larger areas that stand out form their surround • The whole head of newscaster is a large area standing-out from its surround due to intensity.

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Color Map • Across scale differences of approximation bands in chromaticity channels (Cr, Cb) added together: • Channel’slogo and newscaster’s hair are the areas with the most prominent difference from their surround

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Combination of conspicuity maps • Combination of individual conspicuity maps (I,O,C,F) into the final saliency map (S) through the following sigmoid function: • ROI computed by thresholding the saliency map (using Otsu’s method) and filling possible holes in the mask that is produced. • Smooth by low pass filtering non-ROI areas and encode frames as usual (see figure to the right)

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Experimental Results • Aim: • Check if deterioration in ROI encoded videos is observable (visual trial tests) • Compute bit-rate gain • 10 video clips with varying content, both indoor and outdoor • Humans always present • 10 human observers • Non experts (students) • 5 female, 5 male • 60 second to watch video clips (ROI-based and standard MPEG-1 encoding) • Select best • Each video clip viewed twice (200 tests in total)

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Content (selected frames) grandma eye_witness news_cast1 fashion

 Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Visual trials Selections per video clip and average bit rate • eye_witness • fashion • grandma • justice • lecturer • news_cast1 • news_cast2 • night_interview • old_man • soldier

Ten Video Sequences •  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Bit rate gain

 Aim  Overview  Visual Attention  The proposed algorithm  Combination of conspicuity maps  Experimental Results  Conclusions Conclusions - Further work • Visual attention based ROI estimation can be used to indicate regions that need to be encoded with higher accuracy. In this way: • Significant bit-rate gain, compared to MPEG-1, can be achieved, while • the areas identified as visually important by the VA algorithm are in conformance with the ones identified by the human subjects, as it can be deducted by the visual trial tests, • VA ROI based encoding leads to better compression of both Intra-coded and Inter coded frames though the former is higher. • Further work includes • conducting experiments to test the efficiency of the proposed method in the MPEG-4 framework. • examining the effect of incorporating priority encoding by varying the quality factor of the DCT quantization table across VA-ROI and non-ROI frame blocks.

Visual Attention based Region of Interest Coding for Video -telephony Applications

Visual Attention based Region of Interest Coding for Video -telephony Applications

Presentation Transcript

IP Telephony Applications for Handhelds

IP Telephony Applications for Handhelds

Disorders of Visual Attention

Video Coding

Visual Attention

Visual Attention

Visual Attention

Visual Attention

Visual Attention

Region-of-Interest Based Conversational HEVC Coding with Hierarchical Perception Model of Face

Electrophysiology of Visual Attention

Electrophysiology of Visual Attention

Region of Interest

Visual Interest

Wavelet-based Region-of-Interest (RoI) Video Coding

Coding region

Distributed Video Coding for Wireless Visual Sensor Networks

Two-Dimensional Channel Coding Scheme for MCTF-Based Scalable Video Coding

Region of Interest

Region-of-Interest Based H.264 Encoding Parameter Allocation for Low Power Video Communication

Image/Video Coding Techniques for IPTV Applications

Advanced Telephony Applications