An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009
What is action recognition? • The process of identifying actions that occur in video sequences (in this case, by humans). Example Videos Videos from http://www.nada.kth.se/cvap/actions/
Why perform action recognition? • Surveillance footage • User-interfaces • Automatic video organization / tagging • Search-by-video?
Complications • Different scales • People may appear at different scales in different videos, yet perform the same action. • Movement of the camera • The camera may be a handheld camera, and the person holding it can cause it to shake. • Camera may be mounted on something that moves.
Complications, continued • Movement with the camera • The subject performing an action (i.e., skating) may be moving with the camera at a similar speed. Figure from Niebles et al.
Complications, continued • Occlusions • Action may not be fully visible Figure from Ke et al.
Complications, continued • Background “clutter” • Other objects/humans present in the video frame. • Human variation • Humans are of different sizes/shapes • Action variation • Different people perform different actions in different ways. • Etc…
Why have I chosen this topic? • I wanted an opportunity to learn about it. (I knew nothing of it beforehand). • I will likely be incorporating it into my research in the future, but I'm not there yet.
Why have I chosen this topic? • Specifically: • Want to know if someone is on the phone • hence not interruptible; can set status to “busy” • Want to know if someone is present (look for characteristic human actions) • Things other than humans can cause motion, this will make for a more reliable presence detector. • Online status can change accordingly. • Want to know immediately when someone leaves • Can accurately set status to unavailable
1st Paper Overview • Recognizing Human Actions: A Local SVM Approach (2004) • Use local space-time features to represent video sequences that contain actions. • Classification is done via an SVM. Results are also computed for KNN for comparison. • Christian Schuldt, Ivan Laptev and Barbara Caputo (2004)
The Dataset • Video dataset with a few thousand instances. • 25 people each: • perform 6 different actions • Walking, jogging, running, boxing, hand waving, clapping • in 4 different scenarios • Outdoors, outdoors w/scale variation, outdoors w/different clothes, indoors • (several times) • Backgrounds are mostly free of clutter. • Only one person performing a single action per video.
The Dataset Figure from Schuldt et al.
Local Space-time features Figure from Schuldt et al.
Representation of Features • Spatial-temporal “jets” (4th order) are computed at each feature center: • Using k-means clustering, a vocabulary consisting of words hi is created from the jet descriptors. • Finally, a given video is represented by a histogram of counts of occurrences of features corresponding to hi in that video:
Recognition Methods • 2 representations of data: •  Raw jet descriptors (LF) (“local feature” kernel) • Wallraven, Caputo, and Graf (2003) •  Histograms (HistLF) (X2 kernel) • 2 classification methods: • SVM • K Nearest Neighbor
Results Figure from Schuldt et al.
Results • Some categories can be confused with others (running vs. jogging vs. walking / waving vs. boxing) due to different ways that people perform these tasks. • Local Features (raw jet descriptors without histograms) combined with SVMs was the best-performing technique for all tested scenarios.
A Potentially Prohibitive Cost • The method we just saw was entirely supervised. All videos used in the training set had labels attached to them. • Sometimes, labeling videos can be a very expensive operation. • What if we want to be able to train on a training set in which the videos are not necessarily labeled, and recognize the occurrences of their actions in a test set?
2nd Paper Overview • Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words (2008) • Unsupervised approach to classifying actions that occur in video. • Uses pLSA (Probabilistic Latent Semantic Analysis) to learn a model. • Juan Carlos Niebles, Hongcheng Wang and Li Fei-Fei (2008)
Data • Training set: • Set of videos in which a single person is performing a single action. • Videos are unlabeled. • Test set (relaxed requirement): • Set of videos which can contain multiple people performing multiple actions simultaneously.
Space-time Interest Points Figure from Niebles et al.
Space-time Interest Points Figure from Niebles et al.
Feature Descriptors • Brightness gradients calculated for each interest point ‘cube’. • Image gradients computed for each cube (at different scales of smoothing). • Gradients concatenated to form feature vector. • Length of vector: [# of pixels in interest “point”] * [# of smoothing scales] * [# of gradient directions] • Vector is projected to lower dimensions using PCA.
Codebook Formation • Codebook of spatial-temporal words. • K-means clustering of all space-time interest points in training set. • Clustering metric: Euclidean distance • Videos represented as collections of spatial-temporal words.
Representation • Space-time interest points: Figure from Niebles et al.
pLSA: Learning a Model • Probabilistic Latent Semantic Analysis • Generative Model • Variables: • wi = spatial-temporal word • dj = video • n(wi, dj) = co-occurrence table (# of occurrences of word wi in video dj) • z = topic, corresponding to an action
z d w Probabilistic Latent Semantic Analysis • Unsupervised technique • Two-level generative model: a video is a mixture of topics, and each topic has its own characteristic “word” distribution video topic word P(z|d) P(w|z) Slide: Lana Lazebnik T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999
z d w Probabilistic Latent Semantic Analysis • Unsupervised technique • Two-level generative model: a video is a mixture of topics, and each topic has its own characteristic “word” distribution Slide: Lana Lazebnik T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999
The pLSA model Probability of word i giventopic k (unknown) Probability oftopic k givenvideo j(unknown) Probability of word i in video j(known) Slide: Lana Lazebnik
The pLSA model videos videos topics p(wi|dj) p(wi|zk) p(zk|dj) words words topics = Observed codeword distributions (M×N) Codeword distributions per topic (class) (M×K) Class distributions per video (K×N) Slide: Lana Lazebnik
Learning pLSA parameters Maximize likelihood of data: Observed counts of word i in video j M … number of codewords N … number of videos Slide credit: Josef Sivic
Inference • Finding the most likely topic (class) for a video: Slide: Lana Lazebnik
Inference • Finding the most likely topic (class) for a video: • Finding the most likely topic (class) for a visual word in a given video: Slide: Lana Lazebnik
Datasets • KTH human motion dataset (Schuldt et al. 2004) • (Dataset introduced in previous paper.) • Weizmann human action dataset (Blank et al. 2005) • Figure skating dataset (Wang et al. 2006)
Example of Testing (KTH) Figure from Niebles et al.
KTH Dataset Results Figure from Niebles et al.
Results Compared to Prev. Paper Figures from Niebles et al., Schuldt et al.
Results Compared to Prev. Paper First Paper (Supervised Technique) This Paper (Unsupervised Technique)
Weizmann Dataset • 10 action categories, 9 different people performing each category. • 90 videos. • Static camera, simple background.
Weizmann Examples Figure from Niebles et al.
Example of Testing (Weizmann) Figure from Niebles et al.
Weizmann Dataset Results Figure from Niebles et al.
Figure Skating Dataset • 32 video sequences • 7 people • 3 actions: • Stand-spin • Camel-spin • Sit-spin • Camera motion, background clutter, viewpoint changes.
Example Frames Figure from Niebles et al.
Example of Testing Figure from Niebles et al.
Figure Skating Results Figure from Niebles et al.
A Distinction • Action Recognition vs. Event Detection • Action Recognition • = Classify a video of an actor performing a certain class of action. • Event Detection • = Detect instance(s) of event(s) from predefined classes, occurring in video that more closely resembles real-life events (i.e., cluttered background, multiple humans, occlusions).
3rd Paper Overview • Event Detection in Cluttered Videos (2007) • Represent actions as spatial-temporal volumes. • Detection is done via a distance threshold between a template action volume and a video sequence. • Y. Ke, R. Sukthankar, and M. Hebert (2007)
Example images Cluttered background + occlusion: hand-waving Cluttered background: picking something up Images from Ke et al.
Representation of an event • Spatio-temporal volume: Example: hand-waving Image from Ke et al.