Multimodal Group Action Clustering in Meetings

Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute Switzerland

Outline • Meetings: Sequences of Actions. • Why Clustering? • Layered HMM Framework. • Experiments. • Conclusion and Future Work.

Meetings: Sequences of Actions • Meetings are commonly understood as sequences of events or actions: • meeting agenda: prior sequence of discussion points, presentations, decisions to be made, etc. • meeting minutes: posterior sequence of key phases of meeting, summarised discussions, decisions made, etc. • We aim to investigate the automatic structuring of meetings as sequences of meeting actions. • The actions are multimodal in nature: speech, gestures, expressions, gaze, written text, use of devices, laughter, etc. • In general, these action sequences are due to the group as a whole, rather than a particular individual.

GA based on Turn-taking Group actions based on Tasks Group actions based on Interest Level Whiteboard + Note-taking Presentation + Note-taking Presentation Note-taking Monologue + Note-taking Monologue Discussion Whiteboard Brainstorming Engaged Neutral Decision Making Disengaged Information sharing t V3 V3 V1 V2 V4 V5 V1 V6 Structuring of Meetings A meeting is modelled as a continuous sequence of group actions taken from a mutually exclusive and exhaustive set: V = { V1, V2, V3, …, VN } t Meetings

Previous Work • Recognition of Meeting Actions • supervised, single layer approaches. • investigated different multi-stream HMM variants, with streams modelling modalities or individuals. • Layered HMM Approach: • First layer HMM models Individual Actions (I-HMM), second layer HMM models Group Actions (G-HMM). • Showed improvement over single layer approach. Please refer to: I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003. I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005. D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004.

GA based on Turn-taking Group actions based on Tasks Group actions based on Interest Level Whiteboard Discussion Monologue Whiteboard + Note-taking Note-taking Presentation Presentation + Note-taking Monologue + Note-taking Brainstorming Engaged Decision Making Neutral Information sharing Disengaged Why Clustering? • Unsupervised action clusteringinstead of supervised action recognition. • High-level semantic group actions are difficult to: • define: what action lexica are appropriate? • annotate: in general temporal boundaries not precise. • Clustering allows to find natural structure of meeting, and may help us better understand the data.

Single-layer HMM Framework • Single-layer HMM: a large vector of audio-visual features from each participant and group-level features are concatenated to define the observation space Please refer to: I. McCowan, et al “Modeling human interactions in meetings”. In ICASSP 2003. I. McCowan, et al “Automatic Analysis of Multimodal Group Actions in Meetings”. To appear in IEEE Trans. on PAMI, 2005.

Two-layer HMM Framework • Two-layer HMM: By defining a proper set of individual actions, we decompose the group action recognition problem into two layers: from individual to group. Both layers use ergodic HMMs or extensions. Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004 .

Advantages • Compared with single-layer HMM, smaller observation spaces. • Individual layer HMM (I-HMM) is person-independent, then well-estimated model can be trained with much more data. • Group layer HMM (G-HMM) is less sensitive to variations of the low-level audio-visual features. • Easily explore different combination systems. Unsupervised HMM Supervised HMM Audio-visual features

S. Dupont et “Audio-visual speech modeling for continuous speech recognition”. IEEE Transactions on Multimedia 141--151, Sep. 2000. Models for I-HMM • Early Integration (Early Int.) A standard HMM is trained on combined audio-visual features. • Multi-stream HMM (MS-HMM) combine audio-only and visual-only streams. Each stream is trained independently. The final classification is based on the fusion of the outputs of both modalities by estimating their joint occurrence. • Asynchronous HMM (A-HMM) Please refer to S. Bengio. “An asynchronous hidden Markov model for audio-visual speech recognition”. NIPS 2003

6 5 4 3 2 1 Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003 Models for G-HMM: Clustering • Assume unknown segmentation and number of clusters. Segmentations Likelihood Clusters No.

Linking Two Layers (1)

Normalization Linking Two Layers (2) Please refer to: D. Zhang, et al “Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework”. In IEEE Workshop on Event Mining, CVPR, 2004 .

Please refer to I. McCowan, et “Modeling human interactions in meetings”. In ICASSP 2003. Data Collection • Scripted meeting corpus • 30 for training, 29 for testing • 5-minute each meeting • 4-person each meeting • http://mmm.idiap.ch/ • 3 cameras, 12 microphones. • Meeting was ‘scripted’ as a sequences of actions. IDIAP meeting room

camera1 camera3 camera2 Audio-Visual Feature Extraction • Person-specific audio-visual features • Audio • Seat region audio Activity • Speech Pitch • Speech Energy • Speech Rate • Visual • Head vertical centroid • Head eccentricity • Right hand centroid • Right hand angle • Right hand eccentricity • Head and hand motion • Group-level audio-visual features • Audio • Audio activity from white-board region • Audio activity from screen region • Visual • Mean difference from white-board • Mean difference from projector screen

Group actions Monologue Discussion Whiteboard Monologue + Note-taking Presentation Presentation + Note-taking Whiteboard + Note-taking Note-taking Individual actions Speaking Writing Idle Group devices Projector screen Whiteboard Action Lexicon Group Actions = Individual Actions + Group Devices (Group actions can be treated as a combination of individual actions plus states of group devices.)

Speaking Writing Idle S W Person 1 S S W Person 4 S W S Example W Person 2 W S W W Person 3 W S S W Projector Used Whiteboard Used Group Action Monologue1 + Note-taking Discussion Presentation + Note-taking Whiteboard + Note-taking

Please refer to: J. Ajmera et. “A robust speaker clustering algorithm”. In IEEE ASRU Workshop 2003 Performance Measures We use the “purity” concept to evaluate results • Average action purity (aap): “How well one action is limited to only one cluster?” • Average cluster purity (acp): “How well one cluster is limited to only one action?” • Combine “aap” and “acp” into one measure “K”

Results Clustering Individual Meetings Clustering Meeting Collections True number of clusters: 3.93 True number of clusters: 8

Results Clustering individual meetings Clustering entire meeting collection

Conclusions • Structuring of meetings as a sequence of group actions. • We proposed a layered HMM framework for group action clustering: • supervised individual layer and unsupervised group layer. • Experiments showed: • advantage of using both audio and visual modalities. • better performance using layered HMM. • clustering gives meaningful segmentation into group actions. • clustering yields consistent labels when done across multiple meetings.

Future Work • Clustering: • Investigating different sets of Individual Actions. • Handling variable numbers of participants across or within meetings. • Related: • Joint training of layers in supervised 2-layer HMM. • Defining new sets of group actions, e.g. interest-level. • Data collection: • In the scope of the AMI project (www.amiproject.org), we are currently collecting a 100 hour corpus of natural meetings to facilitate further research.

Soft decision outputs the probability of each individual action model as input features to G-HMM: Soft decision: (0.7, 0.1, 0.2) Hard decision: (1, 0, 0) Linking Two Layers (1) • Hard decision the individual action model with the highest probability outputs a value of 1 while all other models output a 0 value. Audio-visual features

Results Clustering individual meetings • Two clustering cases • Clustering individual meetings • Clustering the entire meeting collection • The baseline system • Single-layer HMM Clustering entire meeting collection

Multimodal Group Action Clustering in Meetings

Multimodal Group Action Clustering in Meetings

Presentation Transcript

Collective Action Group

Meetings, Meetings, Meetings

Kosovo Multimodal Transport Strategy and Action Plan

Group Problem-Solving Meetings

Group Intervention Review Meetings :

Group Intervention Review Meetings :

Large Group Meetings

Meetings, Meetings, Meetings

WEEKLY ACTION MEETINGS

Senior Group Meetings

GROUP EMOTIONAL INTELLIGENCE/MEETINGS

Patient and Family Group Meetings

Tatiana Evreinova / Multimodal Interaction Research Group

Parents Action Group

Meetings, Meetings, Meetings!

Networks Action Group

Working group on multimodal meaning representation

Pyrite Action Group

SOCS User Group Meetings

Multimodal

Technical Action Group

Group Action in Germany?