Binocular Stereo

Binocular Stereo Left Image Right Image Binocular Stereo Computer Vision

Binocular Stereo • There are various different methods of extracting relative depth from images, some of the “passive ones” are based on • relative size of known objects, • occlusion cues, such as presence of T-Junctions, • motion information, • focusing and defocusing, • relative brightness • Moreover, there are active methods such as • Radar , which requires beams of sound waves or • Laser, uses beam of light • Stereo vision is unique because it is both passive and accurate. Computer Vision

Human Stereo: Random Dot Stereogram Julesz’s Random Dot Stereogram. The left image, a black and white image, is generated by a program that assigns black or white values at each pixel according to a random number generator. The right image is constructed from by copying the left image, but an imaginary square inside the left image is displaced a few pixels to the left and the empty space filled with black and white values chosen at random. When the stereo pair is shown, the observers can identify/match the imaginary square on both images and consequently “see” a square in front of the background. It shows that stereo matching can occur without recognition. Computer Vision

Human Stereo: Illusory Contours Stereo matching occurs in the presence of illusory. Here not only illusory figures on left and right images don’t match, but also stereo matching yields illusory figures not seen on either left or right images alone. Not even the identification/matching of illusory contour is known a priori of the stereo process. These pairs gives evidence that the human visual system does not process illusory contours/surfaces before processing binocular vision. Accordingly, binocular vision will be thereafter described as a process that does not require any recognition or contour detection a priori. Computer Vision

Left Right Left Right Human Stereo: Half Occlusions An important aspect of the stereo geometry are half-occlusions. There are regions of a left image that will have no match in the right image, and vice-versa. Unmatched regions, or half-occlusion, contain important information about the reconstruction of the scene. Even though these regions can be small they affect the overall matching scheme, because the rest of the matching must reconstruct a scene that accounts for the half-occlusion. Leonardo DaVinci had noted that the larger is the discontinuity between two surfaces the larger is the half-occlusion. Nakayama and Shimojo in 1991 have first shown stereo pair images where by adding one dot to one image, like above, therefore inducing occlusions, affected the overall matching of the stereo pair. Computer Vision

Projective Camera Let be a point in the 3D world represented by a “world” coordinate system. Let be the center of projection of a camera where a camera reference frame is placed. The camera coordinate system has the z component perpendicular to the camera frame (where the image is produced) and the distance between the center and the camera frame is the focal length, . In this coordinate system the point is described by the vector and the projection of this point to the image (the intersection of the line with the camera frame) is given by the point , where y Po=(Xo,Yo,Zo) po=(xo,yo,f) f x z Computer Vision

Projective Camera Coordinate System y where the intrinsic parameters of the camera, , represent the size of the pixels (say in millimeters) along x and y directions, the coordinate in pixels of the image (also called the principal point) and the focal length of the camera. pixel coordinates O x We have neglected to account for the radial distortion of the lenses, which would give an additional intrinsic parameter. Equation above can be described by the linear transformation Computer Vision

Two Projective Cameras P=(X,Y,Z) yl yr xr Or Ol pr=(xo,yo,f) f pl=(xo,yo,f) xl f zl zr A 3D point P projected on both cameras. The transformation of coordinate system, from left to right is described by a rotation matrix R and a translation vector T. More precisely, a point P described as Plin the left frame will be described in the right frame as Computer Vision

Two Projective Cameras Epipolar Lines P=(X,Y,Z) yl yr xr Or er el Ol pr=(xo,yo,f) pl=(xo,yo,f) xl zl zr epipolar lines Each 3D point P defines a plane . This plane intersects the two camera frames creating two corresponding epipolar lines. The line will intersect the camera planes at and , known as the epipoles. The line is common to every plane POlOl and thus the two epipoles belong to all pairs of epipolar lines (the epipoles are the “center/intersection” of all epipolar lines.) Computer Vision

Estimating Epipolar Lines and Epipoles The two vectors, , span a 2 dimensional space and their cross product , , is perpendicular to this 2 dimensional space. Therefore where F is known as the fundamental matrix and needs to be estimated Computer Vision

Computing F (fundamental matrix) • “Eight point algorithm”: • Given two images, we need to identify eight points or more on both images, i.e., we provide n  8 points with their correspondence. The points have to be non-degenerate. • Then we have n linear and homogeneous equations • with 9 unknowns, the components of F. We need to estimate F only up to some scale factors, so there are only 8 unknowns to be computed from the n  8 linear and homogeneous equations. • If n=8 there is a unique solution (with non-degenerate points), and if n > 8 the solution is overdetermined and we can use the SVD decomposition to find the best fit solution. Computer Vision

Stereo Correspondence: Ambiguities Each potential match is represented by a square. The black ones represent the most likely scene to “explain” the images, but other combinations could have given rise to the same images (e.g., red) What makes the set of black squares preferred/unique is that they have similar disparity values, the ordering constraint is satisfied and there is a unique match for each point. Any other set that could have given rise to the two images would have disparity values varying more, and either the ordering constraint violated or the uniqueness violated. The disparity values are inversely proportional to the depth values Computer Vision

Right depth discontinuity AC D E F E F A B D C boundary Surface orientation discontinuity no match F D C B A Left no match Boundary F A D C C D B E A F Stereo Correspondence: Matching Space In the matching space, a point (or a node) represents a match of a pixel in the left image with a pixel in the right image Note 1: Depth discontinuities and very tilted surfaces can/will yield the same images ( with half occluded pixels) Note 2: Due to pixel discretization, points A and C in the right frame are neighbors. Computer Vision

For manipulating with integer coordinate values, one can also use the following representation Restricted to integer values. Thus, for l,r=0,…,N-1 we have x=0,…2N-2 and w=-N+1, .., 0, …, N-1 Note: Not every pair (x,w) have a correspondence to (l,r), when only integer coordinates are considered. For x+w even we have integer values for pixels r and l and for x+w odd we have supixel locations. x=8 Cyclopean Eye The cyclopean eye “sees” the world in 3D where x represents the coordinate system of this eye and w is the disparity axis Right Epipolar Line x r+1 r=5 r-1 w w=2 l-1 l=3 l+1 Computer Vision

Smoothness : In nature most surfaces are smooth in depth compared to their distance to the observer, but depth discontinuities also occur. Usually smoothness implies an ordering constraint, where points to the right of must match points to the right of YES YES x=8 x=8 x=8 Surface Constraints I Smoothness x Right Epipolar Line NO: Ordering Violation Given that the l=3 and r=5 are matched (blue square), then the red squares represent violations of the ordering constraint while the yellowsquares represent smooth matches. YES r+1 r=5 r-1 w YES NO: Ordering Violation w=2 l-1 l=3 l+1 Left Epipolar Line Computer Vision

YES, but note that it is a multiple match for the left eye YES, but note that it is a multiple match for the right eye x=8 x=8 NO: Uniqueness Surface Constraints II Uniqueness: There should be only one disparity value associated to each cyclopean coordinate x. Note: multiple matches for left eye points or right eye points are allowed. Uniqueness x Given that the l=3 and r=5 are matched (blue square), then the red squares represent violations of the uniqeness constraint while the yellowsquares represent unique matches, in the cyclopean coordinate system but multiple matches in the left or right eyes coordinate system. Right Epipolar Line r+1 r=5 r-1 w NO: Uniqueness w=2 l-1 l=3 l+1 Left Epipolar Line Computer Vision

Bayesian Formulation The probability of a surface w(x,e) to account for the left and right image can be described by the Bayes formula as where e index the epipolar lines. Let us develop formulas for both probability terms on the numerator. The denominator can be computed as the normalization constant to make the probability sum to 1. Computer Vision

The Image Formation Peven(e,x,w) Є [0,1], for x+w even, represents how similar the images are between pixels (e,l) in the left image and (e,r) in the right image, given that they match. The epipolar lines are indexed by e. Podd(e,x,w)Є [0,1], forx+w odd, represents how similar intensity edges are in between (e,l -> e,l+1) in the left image and at (e,r ->e,r+1) in the right image. Computer Vision

left right left right The Image Formation I (x+w even) P(e,x,w) Є [0,1], for x+w even, represents how similar the images are between pixels (e,l) in the left image and (e,r) in the right image, given that they match. We use “left” and “right” windows to account for occlusions. We also “spread” the difference in intensities that are below a mean value as much as possible, assuming differences above the mean to be all “unacceptable”, i.e., P(e,x,w)=0. Computer Vision

The Image Formation I (x+w even, cont...) a. W~0, good, but not as good as edge to edge matching, so  P ~ 2/3 b. W~T then, match is still good  P ~ 1/2 c. W~2T then, match should not be good  P  0.16~1/6 Computer Vision

The Image Formation II (x+w odd) where Computer Vision

The Image Formation II (cont...) a. Large D+ ~ TD, small D- ~ 0  P ~ 1 b. D+ ~0, D- ~0, ok, not great (just like W~0 match), so  P ~ 2/3 Say TD>6 c. D+ ~ (TD),D- ~ (TD) , e.g., DL ~ 0 or Dr ~ 0, not good, so say P ~1/5~(2/3)^4 d. D- > D+ is bad news, so as D-  TD and D+ ~ 0 then P  0 Computer Vision

Summary of Probabilities We can normalize across w, i.e., Computer Vision

x right left r=5 r=4 r=3 X right left X w=2 l=1 l=2l=3 x=8 Problems: Flat or (Double) Tilted ? b.Doubled Tilted a.Flat Plane w The probabilities for (a) and (b) are the same. More precisely, after normalization, we have The preference for flat surfaces must be built as a prior model of surfaces. Computer Vision

Flat right left Occluded right left Problems: Flat or Occluded? x r=5 r=4 r=3 r=2 X x=8 X X b.Occluded X a.Flat Plane The probabilities for (a) and (b) are the same. More precisely, after normalization, we have w=2 l=1 l=2l=3 The occlusions are computed as matches between edges, e.g., (x=6,w=1), and we should correct this. Also, and related, the preference for flat surfaces must be built as a prior model of surfaces. Computer Vision

right right left left Problems: Occluded or Tilted? x r=6 r=5 r=4 r=3 X X w b.Tilted Planes a.Flat and Occluded Planes x=8 X w=2 l=2l=3 Occlusions are computed as matches between edges, e.g., (x7=,w2) and we should correct this. The balance (or preference) for flat and occluded surfaces must be built as a prior model of surfaces. Computer Vision

Prior Pairwise Model We will introduce a bias cost for tilt surfaces and another one for occlusions Computer Vision

Right Epipolar Line Right Epipolar Line x x r+1 r=5 r-1 r+1 r=5 r-1 X x=8 x=8 w w w=2 w=2 l-1 l=3 l+1 l-1 l=3 l+1 Prior Model for Tilted Surfaces (x+w even) We introduce a cost for tilt surfaces Tilted: w’=w-1, w+1 How to set the value for TiltCost? Check two neighbor probabilities Flat w’=w Flat should win over Tilt even under some edge-edge matching errors (pixel-pixel being the same). Computer Vision

Right Epipolar Line x r+1 r=5 r-1 X X w x=8 X w=2 l-1 l=3 l+1 Prior Model for Occluded Surfaces (x+w odd) When transitioning from edge to edge, it does not represent edges being matched but rather an occlusion. So a different metric must be introduced. Moreover, when an occlusion transitions occurs ww’=w-1, w+1, we do not want the edge-edge match cost, C[x,w+D] Flat w’=w For two neighbor matches we have Computer Vision

x r=6 r=5 r=4 r=3 right right left left X X w X w=2 l=2l=3 x=8 Problems: Tilted or Occluded ? Occluded with intensity edges wins Computer Vision

x r=6 r=5 r=4 r=3 right right left left X X w X w=2 l=2l=3 x=8 Problems: Tilted or Occluded ? Tilted wins Computer Vision

Final Prior Pairwise Model Computer Vision

x Right Epipolar Line Smoothness (+Ordering) D=3 r+1 r=5 r-1 w x=8 Starting x w=2 l-1 l=3 l+1 w=-3 Limit Disparity • The search is within a range of • disparity : 2D+1 , i.e., • The rational is: • Less computations • Larger disparity matches imply larger errors in 3D estimation. • Humans only fuse stereo images within a limit. It is called the Panum’s limit. We may start the computations at x=D to avoid limiting the range of w values. In this case, we also limit the computations to up to x=2N-2-D Computer Vision

The Posterior Model For the optimization process we are only interested on the cost function associated to the probability. We then store the cost functions associated to P(e,x,w) and T(w,w’) in arrays as follows Computer Vision

Cost function arrays Computer Vision

Dynamic Programming States (2D+1) Fx*[w+D]= q(x,w)C[x, w+D] + mini=-1,0,1{Fx-1*[w+i+D]+F[x,w+D, i+1]} w=D w w=-D Fx-1*[w+i+D] ? F[x, w+D, i+1] ? ? 1 2 3 … x-1 x ... 2N-2 Computer Vision

Dynamic Programming • Stereo-Matching DP( ImageLeft, ImageRight, D, e ) (to be solved line by line) • Initialize • Create the Graph F(V(x,w), E(x,w,x-1,w’) (length is 2N-1 andwidth is 2D+1) • /* precomputation of the match cost and transition cost, stored in arrays C and F */ • loop for v=(x,w) in V (i.e., loop for x and loop for w) • Set-ArrayC[x,w+D] (see previous slides for the formula) • loop for i=-1,0,1 such that (x’=x-1, w’=w+i) • Set-ArrayF[x, w+D, i+1] (see previous slides for the formula) • end loop • end loop • Main loop • loop forx=D, 1, ..., 2N-1-D • loop forw=-D,…,0,…,D • Cost=; • loop fori=-1,0,1 (check –D <= w+i = <D ) /* (w’=w-1, w, w+1) */ • Temp= Fx-1*[w+i+D]+ F[x, w+D, i+1]; • if(Temp < Cost); • Cost= Temp; • back[x,w+D] =w+i; • end loop • Fx*[w+D]= Cost+ if (x+w=even) C[x,w+D]; • end loop • end loop Computer Vision

Epipolar Line Interaction right left Epipolar interaction: the larger the intensity edges the less the cost (the higher the probability) to have disparity changes across epipolar lines Computer Vision

Stereo Correspondence: Belief Propagation (BP) We have now the posterior distribution for disparity values (surface {w(x,e)}) We want to obtain/compute the marginal These are exponential computations on the size (length) of the grid N Computer Vision

We use Kai Ju’s Ph.D. thesis work to approximate the (x,e) graph/lattice by horizontal and vertical graphs, which are singly connected. Thus, exact computation of the marginal in these graphs can be obtained in linear time. We combine the probabilities obtained for the horizontal and vertical graphs, for each lattice site, by “picking” the “best” one (the ones with lower entropy, where .) Kai Ju’s approximation to BP “Horizontal” Belief Tree “Vertical” Belief Tree Computer Vision

Result Computer Vision

Region B Region A Region A Right Region B Left Region A Left Region B Right Some Issues in Stereo: Junctions and its properties (false matches that reveal information from vertical disparities (see Malik 94, ECCV) Computer Vision

Binocular Stereo