PAMI Lab, U. of Waterloo Manglai Zhou Apr. 8, 2010 3D Model-Based Hand Gesture Recognition and Tracking
Topic 1. Introduction 2. Human Hand Modeling 3. Feature Selection and Extraction 4. Model-Based Hand Posture Recognition 5. Hand Motion Tracking 6. Conclusion Refs.
1. Introduction Hand gestures: Purpose of human gestures: conversational, controlling, manipulative, and communicative. More natural and intuitive in CV, esp. in 3-D apps. As an assistive/supporting means for analyzing human intent and identifying potential threats in a multi-modality surveillance system (Project MUSES_SECRET).
1. Introduction Vision-based hand gesture recognition Challenges: Highly articulated, with many joints and high DOFs Highly constrained: static and dynamic constraints, hard to model Two representations: Appearance-based and 3-D model-based Two steps: Static posture recognition Gesture understanding (semantics)
1. Introduction My work mainly concentrates on 3D model-based hand gesture recognition Make use of the kinematic structure of the hand, i.e. the pose of the palm, the angles of finger joint, etc. PRO: View independent, more appropriate for multi-camera vision systems. Provide more detailed info for interpretation of hand gestures. CON: Sophisticated modeling Requires more intensive precessing power.
2. Human Hand Modeling Representations of a hand and 3-D model Human hand motion has 26 DOF Global configuration: six DOF, representing the pose of the hand (position and orientation). Local configuration: 20 angular DOF of fingers DIP and PIP joint each has one degree of freedom for rotation MCP joint has two degrees of freedom Finger motion constraints are applied to define the ranges each finger may move within.
2. Human Hand Modeling Kinematic model is augmented with shape information to generate appearances of a hand seen in 2D images An 3-D model has been built in OpenGL graphic programming environment. Palm is represented by a flat, chamfered rectangular Each segment of fingers was approximated by a sphere-ended cylinder with a unique dimention. Each joint is modeled using a rotation matrix, with a pre-defined range (constraint).
2. Human Hand Modeling • 3-D hand model: • pose1: open palm, pose2: fist
2. Human Hand Modeling • 3-D hand model: • pose3: pointing, pose4: victory
2. Human Hand Modeling All 20 local DOFs are modeled with static and dynamic constraints. Different fingers are color-coded just for easy identification. Actual models will use skin color. 2-D projections of any posture at any angle can be easily obtained by manipulating the model in 3-D space and performing a perspective projection. For global configuration, only one DOF is implemented: rotation along virtical axis.
3. Feature Selection and Extraction Selection of image features and method of extraction have significant impact on the overall system performances.
3. Feature Selection and Extraction High-level features Fingertips, fingers, joint locations, etc. Intuitive representation, efficient processing. hard to extract Low-level features Colors, contours, edges, silhouette, etc. Skin color segmentation Distance metric: Chamfer matching Easier to obtain; sensitive to finger/palm angles
3. Feature Selection and Extraction • Hand feature: silhouette images • pose1: open palm, pose2: fist
3. Feature Selection and Extraction • Hand feature: silhouette images • pose3: pointing, pose4: victory
3. Feature Selection and Extraction Skin color segmentation Canny edge detector (Implemented) Hand shape normalization (dimension) 3D features: Stereo cameras obtain 3D images Depth info helping for cluttered backgrounds Acquired surface is matched to the model surface
4. Model-Based Hand Posture Recognition A hand appears very different at different orientation or viewpoint Database approach: Efficient searching and accurate indexing of image database Template matching: Chamfer distance Where ||x – y|| denotes the Euclidean distance between 2 pixel locations x and y
4. Model-Based Hand Posture Recognition Distance-transform (DT) Approximation of Euclidean distance in 2-D/3-D Distance mask (x3): // int a = 3; // int b = 4; DT generates a new image, in which pixel value gives the distance to the nearest edge. Efficient algorithms to compute. Calculated only once for each frame.
4. Model-Based Hand Posture Recognition Edge model of the target image is superimposed onto the distance image. Avg/Max of distance values that edge model hits gives Chamfer Distance.
4. Model-Based Hand Posture Recognition • An example of DT image (for the V pose)
4. Model-Based Hand Posture Recognition Single frame pose estimation: The estimation from one image or multiple images of different views. Hand orientation determined first. Search over all possible configurations, given the hand orientation and motion constraints.
4. Model-Based Hand Posture Recognition Hand Pose Classification: The classifier is trained by a large number of labeled poses, which can be generated by artificial 3D hand models. Image database indexing: Indexing to improve searching large databases of templates Quickly search for the nearest neighbor(s) of a given input
5. Hand Motion Tracking Hand gesture: a sequence of hand/fingure motion that bears certain meaning. Two types of human hand tracking: 1. Single hypothesis tracking 2. Multiple hypotheses tracking (MHT) The configuration space can be represented as a tree. Tree structures improve processing by employing fast hierarchical searches.
5. Hand Motion Tracking Frame 0 Pose Estimation Initialization Predicted Pose Frame k Calculation of Model Features Prediction Feature Extraction Model Features Observed Features Error Calculation Search for Match Best State Updated State Model-based tracking
5. Hand Motion Tracking Bayesian tracking Multi-resolution partitioning of the state space. Particle filtering Approximate arbitrary distributions with a set of random samples. Deal with clutter and ambiguous situations more effectively, by multiple hypotheses. Tree-based filtering and searching Cluster prototype: a group of similar shape templates.
5. Hand Motion Tracking • Tracking: Bayesian inference problem: • - internal parameters of an object at time t • - measurement obtained. • state estimation
5. Hand Motion Tracking Hierarchical partitioning of the state space
5. Hand Motion Tracking Challenges: How to adapt the hand model to specific target? How to establish correspondences and combine (fuse) image data from multiple cameras in a 3-D framework? How good an algorithm handles occlusions and performs in highly cluttered environment? How to interpret the semantic meanings of a hand gesture?
6. Conclusion 1. Hand gesture recognition is challenging, due to its complex articulate and constraints, high DOF, and heavy self-occlusion . 2. 3-D model-based recognition is suitable in multi-camera vision-based systems. 3. Global config of hand should be determined first to reduce the search space. Particle filtering and tree-based searching help improve tracking robustness and conquer the computation hurdles.
References:  Ying Wu and Thomas S. Huang, Hand modeling, analysis and recognition For Vision-Based Human Computer Interaction. IEEE Signal Processing Mag, May 2001, p. 51-60  A. Erol, et al, Vision-based hand pose estimation: A review. Computer Vision and Image Understanding 108 (2007) 52–73  M. Potamias and V. Athitsos, Nearest Neighbor Search Methods for Handshape Recognition. PETRA’08 July 1519, 2008, Athens, Greece  D. P. Huttenlocher, et. al., Comparing Images Using the Hausdorff Distance. IEEE Trans, PAMI 15 (9) (Sept 1993) 850–863  H.G. Barrow, et. al., Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching, NASA Technical Report, Vision-7, p.659-670. 
PAMI Lab, U. of Waterloo Manglai Zhou Apr. 8, 2010 Paper Survey:A Prototype for 3-D Hand Tracking and Posture Estimation
Overview Present a prototype for 3-D hand tracking and dynamic gesture recognition. Objective: track the hand in a general background and to be able to recognize dynamic gestures in real time. Three phases — simulation, real world video stream test, and multiple camera data fusion Suggest a road map for future development to reach the final goal.
Introduction: Camera-based posture-estimation system. Data glove is used to calibrate and validate the system. (CyberGlove) Color Markers are employed to identify the gesturing hand and the fingertips
The Proposed Approach Three phases: 1. Graphical simulation of the hand tracking problem 2. Tracking with a real video camera and validating the accuracy of the tracking system using the CyberGlove as a reference 3. Extend to multi-cameras
Phase 1: Simulation Study the feasibility single camera vision-based hand tracking 26-DOF 3-D hand model CyberGlove Square marker: palm position and orientation (global configuration) Fingertips: finger posture and joint angles (local configurations)
Phase 1: Simulation (Cont.) 2-D projections are used to estimate the 3-D hand posture. Based on geometric computations and inverse kinematics 3-D/2-D Feature-to-Posture Transformation How 3-D model data are projected onto the image plane. Forward kinematics: 4X4 matrix transformation
Phase 1: Simulation (Cont.) 2-D/3-D Feature-to-Posture Transformation 2-D marker features => hand posture hypothesis Pinhole camera model utilized Perspective geometry and its relevant constraints Finger posture: use detected finger markers to determine a reachable range by the finger along the camera view direction The reachable linear segment is then sampled at constant lengths to calculate a finger posture hypothesis by IK.
Phase 1: Simulation (Cont.) Thumb: binary search of a lookup table of all feasible end-effector positions Other fingers: solved by error model analysis technique
Prototype Phase 2 – Facing the Reality Many practical parameters that are different from the simulation Detection of 2-D features from acquired video frames, by utilizing segmented color and silhouette. Palm: Two colored markers (each on front and back) Fingertips: Five colored ring markers (one for each finger)
Prototype Phase 3 – Multiple cameras Camera sensor fusion Type 1: posture hypothesis is generated separately, and then validated using the observation models Useful when cameras are mobile Type 2 Geometrical transformation between camera coordinate frames is used Best orientation is used by both models
Conclusions Framework presented including two steps: Posture hypothesis and validation The framework provides reasonable results, comparing to the CyberGlove Multiple cameras help cover more area and improve tracking accuracy Handles intermittent occlusion for a short time Future work: 3-D marker-less hand tracking
Comments: A prototype of 3-D model-based hand tracking in a general environment with unconstrained background. Recognize dynamic gestures in real-time. Dataglove is used to validate the proposed framework. Colored markers are used to assist palm and finguretip recognition.
Comments (Cont.): Lack of palm identification of bare hands Hand selhouette and skin color for hand orientation estimation Marker-less edge/contour detection for fingertips Elbow, arm and shoulder info may be used to reduce the dimension of matching of 3-D hand model
3D Model-Based Hand Gesture Recognition and Tracking Questions...... Comments...... Suggestions......