1 / 20

Visual Attention based Region of Interest Coding for Video -telephony Applications

Visual Attention based Region of Interest Coding for Video -telephony Applications. Nicolas Tsapatsoulis Computer Science Dept. University of Cyprus.  Aim  Overview  Visual Attention  The proposed algorithm  Combination of conspicuity maps Experimental Results Conclusions.

urian
Télécharger la présentation

Visual Attention based Region of Interest Coding for Video -telephony Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visual Attention based Region of Interest Coding for Video -telephony Applications Nicolas TsapatsoulisComputer Science Dept.University of Cyprus

  2.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Aim of this study • Develop an algorithm for Region Of Interest (ROI) estimation based on Visual Attention • Visual attention: • Area studying the behavior of humans when observing a scene • Visually important areas are expected to be the first areas humans fixate on: • Such areas as selected as ROIs • ROIs are encoded with higher accuracy than non-ROIs • Possible application: Video telephony • Low bit-rates required, while • Visual quality needs to be preserved

  3.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Overview • ROI areas are computed based on a saliency map • Saliency map combines: • Intensity • Orientation • Color • Skin (Face) conspicuity maps • All conspicuity maps are constructed based on the center-surround principle: • Visually important regions are those that stand out from their surround in terms of intensity, orientation and color • Skin map corresponds to the a-priori knowledge that faces are common in video-telephony applications and humans either implicitly or explicitly fixate on such areas

  4.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Overview (II) • Stand-out areas are computed at various scales using a multiresolution approach: • Both small and large objects can stand-out from their surround • Combination of conspicuity maps into a final saliency map is achieved by using a sigmoid function • Contribution of the various channels (intensity, orientation, color skin) is summed to indicate areas that moderately stand-out from their surround in several channels • Areas that highly stand-out from their surround in a single channel can dominate (saturate) the combined aggregation preserving their importance • Experiments: • Non-ROI areas are smoothed and passed to the encoder. This results in better intra-frame encoding (concentrated DCT coefficients) and better prediction (inter-frame encoding)

  5.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Visual Attention Models • Feature Integration Theory (FIT) by Treisman et al: • Visual features are registered early, automatically and in parallel along a number of separable dimensions (e.g. intensity, color, orientation, size, shape etc). • The FIT theory was the basis of several visual attention algorithms and computational models that have been developed over the last two decades • Saliency based model of Itti & Koch • Low-level vision features (color channels tuned to red, green, blue and yellow hues, orientation and brightness) are extracted from the original color image at several spatial scales. Different spatial scales are produced using Gaussian pyramids, which consist of progressively low-pass filtering and sub-sampling the input image. • Each feature is computed in a center-surround structure akin to visual receptive fields.

  6.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Model of Itti & Koch • RGB color model • Orientation computed at four directions and summing the results • Pyramid produced by Gaussian low-pass filtering and subsampling • Center surround: • Point by point differences of finer and coarser approximations, the latter being first interpolated. • Normalize and add to create the saliency map. • Winner Take All (WTA) architecture to model changes of fixation points

  7.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm • Based on Itti & Koch • Add of a skin-branch to model prior knowledge (existence of faces in video telephony application) • Wavelet based implementation of the pyramid. • YCrCb color model to keep consistency with skin detection (skin color can be modeled via a small area in the Cr-Cb plane – NTSC broadcasting system of analog TV makes use of this property) • Orientation computed as across-scale differences in detail bands (V,H,D) • Combination of conspicuity maps through a sigmoid function to create the final saliency map • Note: The assumption that a final saliency map is created in human brain has not been proved and remains a controversial issue among scientists

  8.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm (II) • Decomposition of Y, Cr, Cb color channels using Daubechie’s wavelets and filter coefficients (length 4)

  9.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm (III) • Center surround at scale j: , ,

  10.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions The proposed algorithm (I) • Conspicuity maps: • Interpolate I-j, O-j, C-j at the finest scale (j =0) and add center surround differences of all scales (for all j) • Three conspicuity maps: • Intensity (I) • Color (C) • Orientation (O) • Plus skin map (F) • Max depth of analysis Jmax: , ,

  11.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Face Map • Skin probability computed at various scales • 2D-Gaussian probability density function for skin • Pseudoprobability computed based on Mahalanobis distance • Face modeled as textured skin area • Top left figure: • Original frame • Top right figure: • Skin Map • Bottom left figure: • Multiscale texture map (range filtering at various scales) • Bottom right figure: • Face map created by multiplying texture and skin maps

  12.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Orientation Map • Across scale differences of detail bands in illumination (Y) channel: • V = Vertical detail: low pass filtering of rows, high pass filtering of columns • H = Horizontal detail: high pass filtering of rows, low pass filtering of columns • D = Diagonal detail: high pass filtering of rows, high pass filtering of columns • Left figure: • Original frame • Center figure: • Orientation Map • Right figure: • Intensity map (not enough for accurately identifying areas that stand-out from their surround due to orientation

  13.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Intensity Map • Across scale difference of approximation band in illumination (Y) channel • In the figures below the eyes of the newscaster are small areas that stand out form their surround • Blouse and channel’s logo are larger areas that stand out form their surround • The whole head of newscaster is a large area standing-out from its surround due to intensity.

  14.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Color Map • Across scale differences of approximation bands in chromaticity channels (Cr, Cb) added together: • Channel’slogo and newscaster’s hair are the areas with the most prominent difference from their surround

  15.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps • Experimental Results • Conclusions Combination of conspicuity maps • Combination of individual conspicuity maps (I,O,C,F) into the final saliency map (S) through the following sigmoid function: • ROI computed by thresholding the saliency map (using Otsu’s method) and filling possible holes in the mask that is produced. • Smooth by low pass filtering non-ROI areas and encode frames as usual (see figure to the right)

  16.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Experimental Results • Aim: • Check if deterioration in ROI encoded videos is observable (visual trial tests) • Compute bit-rate gain • 10 video clips with varying content, both indoor and outdoor • Humans always present • 10 human observers • Non experts (students) • 5 female, 5 male • 60 second to watch video clips (ROI-based and standard MPEG-1 encoding) • Select best • Each video clip viewed twice (200 tests in total)

  17.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Content (selected frames) grandma eye_witness news_cast1 fashion

  18.  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Visual trials Selections per video clip and average bit rate • eye_witness • fashion • grandma • justice • lecturer • news_cast1 • news_cast2 • night_interview • old_man • soldier

  19. Ten Video Sequences •  Aim •  Overview •  Visual Attention •  The proposed algorithm •  Combination of conspicuity maps  Experimental Results • Conclusions Bit rate gain

  20.  Aim  Overview  Visual Attention  The proposed algorithm  Combination of conspicuity maps  Experimental Results  Conclusions Conclusions - Further work • Visual attention based ROI estimation can be used to indicate regions that need to be encoded with higher accuracy. In this way: • Significant bit-rate gain, compared to MPEG-1, can be achieved, while • the areas identified as visually important by the VA algorithm are in conformance with the ones identified by the human subjects, as it can be deducted by the visual trial tests, • VA ROI based encoding leads to better compression of both Intra-coded and Inter coded frames though the former is higher. • Further work includes • conducting experiments to test the efficiency of the proposed method in the MPEG-4 framework. • examining the effect of incorporating priority encoding by varying the quality factor of the DCT quantization table across VA-ROI and non-ROI frame blocks.

More Related