Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper

OUTLINE • Introduction • Data • Body Part Inference and Joint Proposals • Experiments • Discussion

Introduction • Robust interactive human body tracking • gaming, human-computer interaction, security, • telepresence, health-care • Real time depth cameras • tracking from frame to frame but struggle to re-initialize quickly and so are not robust • Our focus on per-frame initialization + tracking algorithm • focus on pose recognition in parts • 3D position candidates for each skeletal joint

Introduction • appropriate tracking algorithm • Tracking people with twists and exponential maps (CVPR 1998) • Tracking loose limbed people (CVPR 2004) • Nonlinear body pose estimation from depth images (DAGM 2005) • Real-time hand-tracking with a color glove (ACM 2009) • Real time motion capture using a single time-of-flight camera (CVPR 2010)

Introduction • inspired by recent object recognition work that divides objects into parts • Object class recognition by unsupervised scale-invariant learning [CVPR 2003] • The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006] • Two key design goals • Computational efficiency • robustness

Introduction dense probabilistic body part labeling + spatially localized near skeletal joints Depth Image 3D proposal segment generate

Introduction • We treat the segmentation into body parts as a per-pixel classification task • Evaluating each pixel separately • Training data • generate realistic synthetic depth images • train a deep randomized decision forest classifier avoid overfitting

Introduction • Overfitting • Simple, discriminative depth comparison image features • maintaining high computational efficiency

Introduction • For further speed, the classifier can be run in parallel on each pixel on a GPU • mean shift resulting in the 3D joint proposals

Data A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN What is Mean Shift ? • PDF in feature space • Color space • Scale space • Actually any feature space you can conceive • … Non-parametric Density Estimation Discrete PDF Representation Non-parametric Density GRADIENT Estimation (Mean Shift) PDF Analysis

Region of interest Center of mass Intuitive Description Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

Region of interest Center of mass Intuitive Description Objective : Find the densest region Distribution of identical billiard balls

Main contribution • Treat pose estimation as object recognition • using a novel intermediate body parts representation • spatially localize joints • low computational cost and high accuracy

Experiments • (i) synthetic depth training data is an excellent proxy for real data • (ii) scaling up the learning problem with varied synthetic data is important for high accuracy • (iii) our parts-based approachgeneralizes better than even an oracular exact nearest neighbor

Data • Depth imaging and Motion capture data • Pose estimation research • often focused on techniques • lack of training data • Two problems on depth image • color • pose

Depth image • Use real mocapdata • Retargettedto a variety of base character models • to synthesize a large, varied dataset • 640x480 image at 30 frames per second • Depth cameras > Traditional intensity sensors • working in low light levels • giving a calibrated scale estimate • resolving silhouette ambiguities in pose

Motion capture data • capture a large database of motion capture (mocap) of human actions • approximately 500k frames • (driving, dancing, kicking, running, navigating menus) • Need not record mocap with variation in rotation • vertical axis, mirroring left-right, scene position body shape and size, camera pose • all of which can be addedin (semi-)automatically

Motion capture data • The classifier uses no temporal information • static poses • not motion • frame to the next are so small as to be insignificant • using ‘furthest neighbor’ clustering algorithm • where the distance between poses • j mean body joints , Pi mean i pose • Define distance more than 5 cm

Motion capture data • necessary to iterate the process of motion capture • sampling from our model • training the classifier • testing joint prediction accuracy • CMU mocap database

Generating synthetic data • build a randomized rendering pipeline • sample fully labeled training images • Goals • realism and variety

Generating synthetic data • First : randomly samples a set of parameters • Then uses standard computer graphics techniques • render depth and body part images • from texture mapped 3D meshes • Use autodeskmotionbulider • slight random variation in height • and weight give extra coverage of body shapes • Others parameters

Generating synthetic data

Body Part Inference and Joint Proposals • Body part labeling • Depth image features • Randomized decision forests • Joint position proposals

Body part labeling • intermediate body part representation • as color-coded • Some directly localize particular skeletal joints • others fill the gaps • transforms the problem into one that can readily be solved by efficient classification algorithms

Body part labeling • The parts are specified in a texture map

Body part labeling • 31 body parts: • LU/RU/LW/RW head, neck, • L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R • hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, • L/R ankle, L/R foot (Left, Right, Upper, loWer)

Depth image features • di (x) is the depth at pixel x in image I • Ө= (u, v) describe offsets u and v • 1/di (x) ensures the features are depth invariant

Depth image features • Individually these features provide only a weak signal • combination in a decision forest • sufficient to accurately • disambiguate all trained parts

Depth image features • The design of these features was strongly motivated by their computational efficiency • no preprocessing is needed • read at most 3 image pixels • at most 5 arithmetic operations • straightforwardly implemented on the GPU

Randomized decision forests • Randomized decision forests • fast and effective multi-class classifiers • Implemented efficiently on the GPU • 1

Randomized decision forests

Joint position proposals • generate reliable proposals for the positions of 3D skeletal joints • the final output of our algorithm • used by a tracking algorithm to self initialize • and recover from failure

Joint position proposals • A local mode-finding approach based on mean shift with a weighted Gaussian kernel • ^xi is the reprojection of image pixel xi • bc is a learned per-part bandwidth • world space given depth dI (xi)

Assumption : The data points are sampled from an underlying PDF Non-Parametric Density Estimation Data point density implies PDF value ! Assumed Underlying PDF Real Data Samples

Non-Parametric Density Estimation Assumed Underlying PDF Real Data Samples

Non-Parametric Density Estimation ? Assumed Underlying PDF Real Data Samples

Assumption : The data points are sampled from an underlying PDF Parametric Density Estimation Estimate Assumed Underlying PDF Real Data Samples

Joint position proposals • Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel

Joint position proposals • The detected modes • lie on the surface of the body • pushed back into the scene by a learned z offset produce a final joint position proposal • Bandwidth Bc = 0.065m • Threshold λc = 0.14 • Z offset = 0.039m • Set = 5000 images by grid search

Joint position proposals

Experiments • provide further results in the supplementary material • 3 trees, 20 deep, 300k training images per tree • 2000 training example pixels per image • 2000 candidate features Ө • 50 candidate thresholds ζ per feature

Experiments • Test data • challenging synthetic and real depth images to evaluate our approach • synthesize 5000 depth images • Real test set • 8808 frames of real depth images • 15 different subjects • 7 upper body joint positions

Experiments • Error metric: • quantify both classification • average of the diagonal of the confusion matrix • between the ground truth part label and the most likely inferred part label • Joint prediction accuracy • generate recall-precision curvesas a function of confidence threshold • quantify accuracy as average precision per joint

Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images

Presentation Transcript

Detection, Segmentation, and Pose Recognition of Hands in Images

Estimating Human Shape and Pose from a Single Image

Real-Time Facial Recognition

Real Time Gesture Recognition of Human Hand

Real time Color- FilteredAperture Depth Estimation with single image

Human Pose Recognition

Real-time depth up-sampling

Real-time Object Recognition in Sparse Range Images Using Error Surface Embedding

Optical depth from shadows in orbiter images of Mars

Recognition and tracking of human body parts

Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-time head pose classification in uncontrolled environments

Human Identity Recognition in Aerial Images

Real-Time Detection, Alignment and Recognition of Human Faces

Layered Depth Images

Real-Time Detection, Alignment and Recognition of Human Faces

Real-Time Multivariate Detection from Single Cells

Real-Time Speech Recognition

Pose Invariant Palmprint Recognition

Human Pose detection

Accuracy in Real-Time Depth Maps

Computer Vision: Gesture Recognition from Images