340 likes | 587 Vues
[Human-Computer Interaction : From Theory to Applications] Final Report (Paper Study) A survey on vision-based human action recognition Image and Vision Computing 28 (2010) 976–990. Student : Husan -Pei Wang( 王瑄珮 ) Student ID : P96994020 Instructor : Jenn-Jier Lien Ph.D. Outline.
E N D
[Human-Computer Interaction : From Theory to Applications]Final Report (Paper Study)A survey on vision-based human action recognitionImage and Vision Computing 28 (2010) 976–990 Student : Husan-Pei Wang(王瑄珮) Student ID : P96994020 Instructor : Jenn-Jier Lien Ph.D.
Outline • 1. Introduction • 1.1 Challenges and characteristics of the domain • 1.2 Common dataset • 2. Image Representation • 2.1Global Representations • 2.1.1 Space–time volumes • 2.2 Local Representations • 2.2.1 Local descriptors • 2.2.2 Correlations between local descriptors • 2.3 Application-specific representations • 3. Action classification • 3.1Direct classification • 3.2 Temporal state-space models • 4. Discussion • 5. References
1. Introduction(1/2) • This paper considers the task of labeling videos containing human motion with action classes. • Challenging: • Variations in motion performance • Recording settings • Inter-personal differences. • This paper provides a detailed overview of current advances in the field to solve the challenge.
1. Introduction(2/2) • The recognition of movement can be performed at variouslevels of abstraction. • This paper adopt the hierarchy used by Moeslund et al. [1]: • Action primitive: 單一的動作 • 例如: 左腳向前 • Action: 持續的單一動作 • 例如: 跑 • Activity: 活動,由各種action所組成 • 例如: 跨欄競賽由跑、跳所組成
1.2 Common dataset • Actions: 6 • Actors: 25 • Scenarios: 4 • Background: • relatively static • Sequences of sport • motions: 150 • considerable variation: • human appearance、 • camera movement、 • viewpoint、 • illumination、 • background.. • Actions: 8 • Actors: No Limit • Huge variety: • actions performance • occlusions, • camera movements dynamic backgrounds . • Actions: 14 • Actors: 11 • Viewpoint: 5 • Camera Views: fixed • Background: • static • illumination: • static • Include Silhouettes • and volumetric voxel • Actions: 10 • Actors: 10 • Background: • static • Include Foreground • silhouettes • Widely used sets: • KTH human motion dataset • Weizmann human action dataset • INRIA XMAS multi-view dataset • UCF sports action dataset • Hollywood human action dataset
2. Image Representation • This section will discuss the features that are extracted from the image sequences. • This paper divide image representations into two categories: • Globalrepresentations:Obtainedin a top-down fashion • Local representations : Proceeds in a bottom-up fashion
2.1Global Representations • Global representations encode the region of interest (ROI) of aperson as a whole. • The ROI is usually obtained through background subtraction or tracking. • They are sensitive to noise, partial occlusions and variations in viewpoint. • To partly overcome above issues: • Grid-based approaches spatially divide the observation into cells, each of which encodes part of the observation locally.
2.1.1 Space–time volumes • A 3D spatio-temporal volume (STV) is formed by stackingframes over a given sequence. • Require:Accurate localization, alignment and possibly background subtraction. • Blank et al. [2,3] first stack silhouettes over a given sequence to form an STV (see following picture).
2.2 Local Representations(1/2) • Local representations describe the observation as a collection oflocal descriptors or patches. • somewhatinvariant to changes in viewpoint, person appearance andpartial occlusions. • Space–time interest points are the locations in space and time where sudden changes of movement occur in the video. • Laptev and Lindeberg [4] extended the Harris corner detector[5] to 3D. Space–time interest points are those points where thelocal neighborhood has a significantvariation in both the spatialand the temporal domain. • The work is extendedto compensate for relative camera motions in [6]. • Drawback:the relatively small numberof stable interest points.
2.2 Local Representations(2/2) • Improve:Dollár et al. [7] apply Gabor filtering on the spatial and temporal dimensions individually. • The number of interest points is adjusted by changing the spatial and temporal size of the neighborhood in which local minima are selected. • Instead of detecting interest points over the entire volume: • Wong and Cipolla [8] first detect subspaces of correlated movement. These subspaces correspond to large movements such as an arm wave. • Within these spaces, a sparse set of interest points is detected.
2.2.1 Local descriptors(1/2) • Local descriptors summarize an image or video patch in a representation that is ideally invariant to background clutter, appearanceand occlusions, and possibly to rotation and scale. • The spatialand temporal size of a patch is usually determined by the scale ofthe interest point. Extraction of space–time cuboids at interest points from similar actions performed by different persons[6]
2.2.1 Local descriptors(2/2) • Challenge: Different number and the usually high dimensionality of the descriptors. It’s hard to compare sets of local descriptors. • Overcome: • A codebook is generated by clustering patches • Selecting either cluster enters or the closest patches as code words. • A local descriptor is described as a codeword contribution. • A frame or sequence can be represented as a bag-of-words, a histogram of codeword frequencies.
2.2.2 Correlations between local descriptors • In this section, it will describe approaches that exploit correlations between local descriptors for selection or the construction of higher-level descriptors. • Scovanner et al. [11] construct a word co-occurrence matrix, and iteratively merge words with similar co-occurrences until the difference between all pairs of words is above a specified threshold. • This leads to a reduced codebook size and similar actions are likely to generate more similar distributions of code words. • Correlations between descriptors can also be obtained by tracking features. • Sun et al. [12] calculate SIFT(尺度不變特徵轉換) descriptors around interest points in each frame and use Markov chaining to determine tracks of these features.
2.3 Application-specific representations • This section discuss the works which use representations that are directly motivated by the domain of human action recognition. • Smith et al. [13] use a number of specifically selected features. • low-level : deal with color and movement. • higher-level : obtained from head and hand regions. • A boosting scheme: account the history of the action performance. • Vitaladevuni et al. [14] is inspired by the observation that human actions differ in accelerating and decelerating force. • Identify: reach, yank and throw types. • Temporal segmentation into atomic movements described with movement type, • spatial location and direction of movement is performed first.
3. Action classification • When an image representation is available for an observed frame or sequence, human action recognition becomes a classification problem. • Direct classification • Temporal state-space models • Action detection
3.1Direct classification(1/2) • Not pay specialattention to the temporal domain. • Summarize all frames of an observed sequence into a single representation or perform action recognition for each frame individually. • Dimensionality reduction : • 降維方法即是透過分析資料來尋找一種Embedding的方式,將資料從原先的高維空間映射到低維空間。 • 降低運算複雜度 • 取得更有本質意義的資料表示方式 • 容易將高維資料視覺化(Visualization)。
3.1Direct classification(2/2) • Nearest neighbor classification • 欲判斷某未知資料的類別時,僅須找出距離它最近的已知類別資料再透過已知資料類別即可決定該未知資料的類別。 • 優點: 簡單、有一定的精度 • 缺點: 計算時間以及對記憶體空間需求會隨著原型資料點個數或特徵變數增加而增加。 • Discriminative classifiers • 主要將資料分類成兩種或更多的類別,而非將它們model化。 • 最後會型成一個很大的分類結果,但每個類別都很小。
3.2 Temporal state-space models(1/6) • State-space models consist of states connected by edges. • These edges model probabilities between states, and between states and observations. • Model: • State: action performance (1state, 1 action performance) • Observation: image representation at a given time. • Dynamic time warping(DTW) • DTW是計算輸入的音高向量和資料庫中標準答案的音高向量之前的歐幾里得距離。 • 需時較長、但有較高的辨識率。
3.2 Temporal state-space models(2/6) • Generative models • Hidden Markov models (HMM): • 以統計的方式來建立每個類別的(動態)機率模型 • 此種模型特別適用於長度不固定的輸入向量 • 不知道有多少個 states,states 的多寡需要由經驗來假設。 • 三個組成要素 • observation probabilities :是我們觀察到的某個東西是從某一個 hidden state 來的機率。 • transition probabilities :是 hidden states 之間轉換的機率。 • initial probabilities。 :probabilities 是一開始的時候,落在某一個 hidden state 的機率。
3.2 Temporal state-space models(3/6) • Generative models Applications • Feng and Perona [15] use a static HMM where key poses correspond to states. • Weinland et al. [16] construct a codebook by discriminatively selecting templates. In the HMM, they condition the observation on the viewpoint. • Lv and Nevatia [17] uses an Action Net, which is constructed by considering key poses and viewpoints. Transitions between views and poses are encoded explicitly. • Ahmad and Lee [18] take into account multiple viewpoints and use a multi-dimensional HMM to deal with the different observations.
3.2 Temporal state-space models(4/6) • Generative models • Instead of modeling the human body as a single observation, one HMM can be used for each every body-part. • This makes training easier. Because.. • The combinatorial complexity is reduced to learning dynamical models for each limb individually. • Composite movements that are not in the training set can be recognized.
3.2 Temporal state-space models(5/6) • Discriminative models • 將一個訓練集(training set)輸出的質最大化。 • HMMs assume that observationsin time are independent, which is often not the case. • Discriminative models overcome this issue by modeling a conditional distribution over action labels given the observations. • Discriminative models are suitable for classification of related actions. • Discriminative graphical models require many training sequences to robustly determine all parameters.
3.2 Temporal state-space models(6/6) • Discriminative models • Conditional random fields (CRF) are discriminative models that can use multiple overlapping features. • CRF同時擁有有限狀態HMM和SVM技術的優點,像是相依特徵和透過完整順序來做為優先考量。 • Variants of CRFs have also been proposed. • Shi et al. [19] use a semi-Markovmodel (SMM), which is suitable for both action segmentation andaction recognition.
3.3. Action detection • Some works assume motion periodicity, which allows for temporalsegmentation by analyzing the self-similarity matrix. • Seitzand Dyer [20] introduce a periodicity detection algorithm thatis able to cope with small variations in the temporal extent of amotion. • Cutler and Davis [21] perform a frequency transform on the self similaritymatrix of a tracked object. • Peaks in the spectrum correspondto the frequency of the motion. • The type of action is determinedby analyzing the matrix structure. • Polana and Nelson [22]use Fourier transforms to find the periodicity and temporallysegment the video. • They match motion features to labeled 2D motiontemplates
4. Discussion(1/5) • Image representation • Global image representations • 優點: • Good results • They can usually be extracted with low cost. • 缺點: • Limited to scenarios where ROIs can be determined reliably. • Cannot deal with occlusions. • Local representations • Takes into account spatialand temporal correlations between patches. • occlusionshas largely been ignored.
4. Discussion(2/5) • About viewpoints • Most of the reported work is restricted to fixed. • Multiple view-dependent action models solves this issue. • BUT-> Increased training complexity • About Classification • Temporal variations are not explicitly modeled, which proved to be a reasonable approach in many cases. • But-> For more complex motions, it is questionable whether this approach is suitable. • Generative state-space models such as HMMs can model temporal variations. • But-> Have difficulties distinguishing between related actions. • Discriminative graphical approaches are more suitable.
4. Discussion(3/5) • About action detection • Many approaches assume that… • The video is readily segmented into sequences • It contain one instance of a known set of action labels. • The location and approximate scale of the person in the video is known or can easily be estimated. • Thus-> The action detection task is ignored, which limits the applicability to situations where segmentation in space and time is possible. • It remains a challenge to perform action detection for online applications.
4. Discussion(4/5) • The HOHA dataset [23] targets action recognition in movies, whereas the UFC sport dataset [24] contains sport footage. • The use of application-specific datasets allows for the use of evaluation metrics that go beyond precision and recall. • Such as : speed of processing or detection accuracy. • The compilation or recording of datasets that contain sufficient variation in movements, recording settings and environmental settings remains challenging and should continue to be a topic of discussion.
4. Discussion(5/5) • The problem of labeling data • For increasingly large and complex datasets, manual labeling will become prohibitive. • Multi-modal approach could improve recognition in some domains • For example in movie analysis. Also, context such as background, camera motion, interaction between persons and person identity provides informative cues [25]. • This would be a big step towards the fulfillment of the longstanding promise to achieve robust automatic recognition and interpretation of human action.
5. References(1/4) • [1] Thomas B. Moeslund, Adrian Hilton, Volker Kruger, A survey of advances in vision-based human motion capture and analysis, Computer Vision and Image Understanding (CVIU) 104 (2–3) (2006) 90–126. • [2] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, Ronen Basri, Actions as space–time shapes, in: Proceedings of the International Conference On Computer Vision (ICCV’05), vol. 2, Beijing, China, October 2005, pp. 1395–1402. • [3] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, Ronen Basri, Actions as space–time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 29 (12) (2007) 2247–2253. • [4] Ivan Laptev, Tony Lindeberg, Space–time interest points, in: Proceedings of the International Conference on Computer Vision (ICCV’03), vol. 1, Nice, France, October 2003, pp. 432–439. • [5] Chris Harris, Mike Stephens, A combined corner and edge detector, in: Proceedings of the Alvey Vision Conference, Manchester, United Kingdom, August 1988, pp. 147–151. • [6] Ivan Laptev, Barbara Caputo, Christian Schuldt, Tony Lindeberg, Local velocity-adapted motion events for spatio-temporal recognition, Computer Vision and Image Understanding (CVIU) 108 (3) (2007) 207–229. • [7] Piotr Dollar, Vincent Rabaud, Garrison Cottrell, Serge Belongie, Behavior recognition via sparse spatio-temporal features, in: Proceedings of the International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS’05), Beijing, China, October 2005, pp. 65–72. • [8] Shu-Fai Wong, Roberto Cipolla, Extracting spatiotemporal interest points using global information, in: Proceedings of the International Conference On Computer Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8.
5. References(2/4) • [9] Juan Carlos Niebles, Hongcheng Wang, Li Fei-Fei, Unsupervised learning ofhuman action categories using spatial–temporal words, International Journalof Computer Vision (IJCV) 79 (3) (2008) 299–318. • [10] Christian Schuldt, Ivan Laptev, Barbara Caputo, Recognizing human actions: alocal SVM approach, Proceedings of the International Conference on PatternRecognition (ICPR’04), 2004, vol. 3, Cambridge, United Kingdom, 2004, pp.32–36. • [11]Paul Scovanner, Saad Ali, Mubarak Shah, A 3-dimensional SIFT descriptor and its application to action recognition, in: Proceedings of the International Conference on Multimedia (MultiMedia’07), Augsburg, Germany, September 2007, pp. 357–360. • [12] Ju Sun, Xiao Wu, Shuicheng Yan, Loong-Fah Cheong, Tat-Seng Chua, Jintao Li, Hierarchical spatio-temporal context modeling for action recognition, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, June 2009, pp. 1–8. • [13] Paul Smith, Nielsda Vitoria Lobo, Mubarak Shah, TemporalBoost for event recognition, in: Proceedings of the International Conference On Computer Vision (ICCV’05), vol. 1, Beijing, China, October 2005, pp. 733–740. • [14] Shiv N. Vitaladevuni, ViliKellokumpu, Larry S. Davis, Action recognition using ballistic dynamics, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8.
5. References(3/4) • [15] XiaolinFeng, PietroPerona, Human action recognition by sequence of moveletcodewords, in: Proceedings of the International Symposium on 3D Data Processing, Visualization, and Transmission 3DPVT’02), Padova, Italy, June 2002, pp. 717–721. • [16] Daniel Weinland, Edmond Boyer, RemiRonfard, Action recognition from arbitrary views using 3D exemplars, in: Proceedings of the International Conference On Computer Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8. • [17] FengjunLv, Ram Nevatia, Single view human action recognition using key pose matching and Viterbi path searching, in: roceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’07), Minneapolis, MN, June 2007, pp. 1–8. • [18] Mohiuddin Ahmad, Seong-Whan Lee, Human action recognition using shape and CLG-motion flow from multi-view image sequences, Pattern Recognition 41 (7) (2008) 2237–2252. • [19] Qinfeng Shi, Li Wang, Li Cheng, Alex Smola, Discriminative human action segmentation and recognition using semi-Markov model, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. • [20] Steven M. Seitz, Charles R. Dyer, View-invariant analysis of cyclic motion, International Journal of Computer Vision (IJCV) 25 (3) (1997) 231–251.
5. References(4/4) • [21] Ross Cutler, Larry S. Davis, Robust real-time periodic motion detection, analysis, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22 (8) (2000) 781–796. • [22]RamprasadPolana, Randal C. Nelson, Detection and recognition of periodic, nonrigid motion, International Journal of Computer Vision (IJCV) 23 (3) (1997) 261–282. • [23] Ivan Laptev, MarcinMarszałek, CordeliaSchmid, enjaminRozenfeld, Learning realistic human actions from movies, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. • [24] Mikel D. Rodriguez, Javed Ahmed, Mubarak Shah, Action MACH: a spatiotemporal maximum average correlation height filter for action recognition, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. • [25] MarcinMarszałek, Ivan Laptev, CordeliaSchmid, Actions in context, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, June 2009, pp. 1–8.