Pedestrians Detection and Tracking

Pedestrians Detection and Tracking Papers: Pfinder: Real-Time Tracking of the Human Body,Wren, C., Azarbayejani, A., Darrell, T., and Pentland, P. Tracking and Labelling of Interacting Multiple Targets, J. Sullivan and S. Carlsson

Presentation Overview • This talk will cover two distinct tracking algorithms. • Pfinder: Real-Time Tracking of the Human Body • Multi-target tracking and labeling • For each of them we will present: • Motivation and previous approaches • Review of relevant techniques • Algorithm details • Applications and demos

Pedestrians Detection and Tracking • There is always a major trade-off between genericity and accuracy. • Because we know we are trying to identify and track human beings, we can start making assumptions about our objects. • If we have more specific information (example: tracking players in a football game), we can add even more specific assumptions. • These kind of assumptions will help us to get a more accurate tracking.

Tracking Algorithm #1 Pfinder: Real-Time Tracking of the Human Body

Motivation

Introduction • Pfinder is a tracking algorithm • Detects human motion in real-time. • Segments the person’s body • Analyze internal features (head, body, hands, and feet)

Background • Many Tracking algorithm use a static model – For each frame, similar pixels are searched in the vicinity of the bounding box of the previous frame. • We will use a dynamic model – One that learns over time. • Most tracking algorithms need some user-input for initialization. • The presented algorithm will do automatic initialization.

Before we start – Some probability background… • Covariance • For a domain of dimension , we define the sampling domain’s variables • The covariance of two variables is defined:where The covariance of two variables is a measure of how much two variables change together.

probability background – cont’ • The Covariance Matrix (marked ) is defined: • Normal distribution of a variable is defined:

probability background – cont’ • The more generalized multivariate distribution is defined:

probability background – cont’ • Mahalanobis distance: • The distance measured from a sample vector To a group of samples with mean and a covariance matrix is defined:

Algorithm Steps • (Automatic) Initialization • Background is modeled in a few seconds of video where the person does not appear. • When the person enters the scene, he is detected and modeled. • The analysis loop • After the background and person models are initialized, each pixel in the next frame is checked against all models.

Initialization • The first step in the algorithm is build a preliminary representation of the person and the surrounding scene. • First we need to acquire a video sequence of the scene that do not contain a person in order to model the background

Background Modeling • The algorithm assumes a mostly-static background. • However, it is needed to be robust in illumination changes and to be able to recover from changes in the scene (e.g. a book that was moved from one place to another).

Background Modeling – cont’ • The images in the video are using the YUV color representation (Y = luminance component, UV = chrominance component). • There exists a transformation matrix which transforms RGB representation to YUV. • The algorithm models the background by matching each pixel a Gaussian that describes the pixel’s mean and distribution.

Background Modeling – cont’ • We do this by measuring the pixel’s YUV mean and distribution over time This pixel has some YUV value on this frame, on the next frame, it might change, so we mark it’s mean asand its covariance matrix as y u v

Person Modeling • After the scene has been modeled, Pfinder watches for large deviations from this model. • This is done by measuring the Mahalanobis distance in the color space between the new pixel’s value and to the scene model values in the appropriate location. • If the distance is large enough and the change is visible over a sufficient number of pixel, we begin to build a model of a person.

Person Modeling – cont’ • The algorithm represents the detected person’s body parts using blobs. • Blobs are 2D representation of a Gaussian distribution of the spatial statistics. • Also, a support map is built for each blob :

Person Modeling – Contour Analyzer • To initialize the blob models, Pfinder uses a 2D contour shape analysis that attempts to identify the head, hands, and feet location. • A blob is created for each identified location.

Person Modeling – Class analyzer • The class analyzer find the location of body features by using statistics from their position and color in the previous frames. • Because no statistics have been gathered yet (this is the first frames where the person appears), the algorithm uses ready-made statistical priors.

Person Modeling – Class analyzer • Hand and face blobs have strong flesh-colored color priors (it appears that normalized skin color is constant across different skin pigmentation levels). • The other blobs are initialized to cover the clothing regions

Person Modeling – cont’ • The contour analyzer can find features in a single frame, but the results tend to be noisy. • The class analyzer produce accurate result but it depends on the stability of the underlying models (i.e. no occlusion). • A blend of contour analysis and class model is used to find the feature in the next frame.

Contour example original contour

Initialization - Review • After the initialization step of the algorithm, the information is now divided into scene and person models. • Scene (background) model consist of the color space distribution for each pixel. • Person model consist of spatial space and color space distribution for each blob • The spatial space determines the blob’s location and size • The color space determines the distribution of color in the blob

The Analysis Loop Given a person model and a scene model, we can now acquire a new image, interpret it, and update the scene and person models.

The Analysis Loop – cont’ • Update the spatial model associated with each blob using the blob’s measured statistics, to yield the blob’s predicted spatial distribution for the current image.This is done with a Kalman filter assuming simple Newtonian dynamics.

Kalman Filtering • Measuring information from video sequence can be very inaccurate sometimes

Kalman Filtering – cont’ • Without some kind of filtering it would be impossible to make any short-term forward predictions. • Also, each measurement is used as a seed for the tracking algorithm atthe next frame. • Some kind of filteringis needed to make themeasurements moreaccurate.

Kalman Filtering – cont’ • Each tracked object is represented with a state vector (usually location) • With each new frame, a linear operator is applied to the state to generate the new state, with some noise mixed in, and some information from the controls on the system • Usually, Newton’s laws are applied.

Kalman Filtering – cont’ • The noise added is a Gaussian noise with mean 0 and a covariance matrix. • The predicted state is then updated with the real measurement to create the estimate for the next frame.

The Analysis Loop – cont’ • Now when a new image is acquired, we measure the likelihood of each pixel being a member of each of the blob models and the scene model:the vector is defined as the location and color of each pixel. For each class , the log likelihood is measured:

The Analysis Loop – cont’ • Each pixel is now assign to a particular class.Either one of the blobs or the background.A support map is build which indicates which pixel belong to which class

The Analysis Loop – cont’ • Connectivity constraints are enforced by iterative morphological growing from a single central point, to produce a connected region. • First, a foreground region is growncomprised of all the blob classes. • Then, each of the individual blob isgrown with the constraint that theyremain confined to the foregroundregion

The Analysis Loop – cont’ • Now the statistical model for each class is updated. • For the blob classes, the new mean is calculated • The Kalman filter statistics are also updated at this time. • Background pixels are also updated to have the ability to recover from changes in the scene.

Limitations • The algorithm employs several domain-specific assumptions in order to have an accurate tracking. • If one of the assumptions break, the system degrades. • However, the system can recover after a few frames if the assumptions again hold • The system can track only after a single person.

Performance • RMS (Root Mean Square) errors were found on the order of a few pixels:

Applications • A Modular Interface - An application that provides programmers tracking, segmentation and feature detection. • The ALIVE application places 3d animated characters that interact with the person according to his gestures. Here, Rexy!

Applications – cont’ • The SURVIVE application recorded the movement of the person to navigate a 3d virtual game environment. I guess you can’t get any nerdy than this

Applications – cont’ • Recognition of American Sign Language • Pfinder was used as a pre-process for detecting a 40-word subset of ASL. It had 99% sign accuracy

Applications – cont’ • Avatars and Telepresence • The model of the person is translated to several blobs. Which can be used to model 2d characters.

Tracking Algorithm #2 Multi-Target Tracking and Labelinguses slides by Josephine Sullivanfrom http://www.csc.kth.se/~sullivan/

Motivation

Introduction • The multi-target tracknig and labeling algorithm • Track multiple targets over large periods of time • Robust collision recovery • Does labeling even when targets are interacting

Multi Tracking and Labeling Sometimes Easy Sometimes Hard

Introduction • The algorithm addresses the problem of the surveillance and tracking of multiple persons over a wide area. • Previous multi-target tracking algorithms are based on Kalman filtering and advanced techniques of particle filtering. • Often tracking algorithms fails if occlusion or interaction between the targets occurs.

Introduction – cont’ • This work’s specific goal  is to track and label the players in a football game. • This is especially hard when players collide and interact

Camera Setup • The researchers used a wide-screen video which was produced using the video from four calibrated cameras. • The images were stitched after the homography between the images was computed. • This produces a high-resolution video which gives good tracking results

Algorithm Steps • Background modeling and subtraction • Build an interaction graph • Resolve split/merge situations • Recover identities of temporally separated player trajectories.

Background modeling • A probabilistic model of the image gradient of each pixel in the background is obtained. • The gradient is used to prevent situation where the player’s uniform has the same color as the background.

Pedestrians Detection and Tracking