Collection of multimodal data Face – Speech – Body

Collection of multimodal dataFace – Speech – Body George Caridakis ICCS Ginevra Castellano DIST Loic Kessous TAU

Overview • Objectives • Scenario • Equipment specifications • Subjects & Procedure • Visual aspects • Acoustic aspects • Future processing • Please try this at home…

Objectives • Collection of emotional multimodal data • Process different modalities • Holy Grail: “EMOTION RECOGNITION”

Scenario • Inspired by GEMEP corpus • Pseudo-language sentence(“Toko”, damato ma gali sa) • Standing body posture • 10 subjects • 8 emotions uniformly distributed through the quadrants (2D emotion theory, valence-arousal) • 3 repetitions of emotion specific gesture • 3 repetitions of emotion independent gesture

Emotion specific gestures

Equipment specifications • 2 DV cameras • Full body • Face • Wireless microphone (shirt-mounted) • PC + External sound card • Uniform dark background • 2 artificial light sources • Light coloured, long sleeves shirt ;)

Subjects & Procedure • Subjects • 10 “actors” • 6 males • 4 females • despair, hot anger, irritation sadness, interest, pleasure, joy, pride Procedure • Subject instructions • Clap before every execution: synchronize streams

Video quality issues • Highest possible resolution • Progressive video (not interlaced) • Correct exposure • Good color quality • No compression artifacts • Uniform lighting

Interlacing / Over-exposure • Interlacing / De-Interlacing • Over-exposure • 70% zebra pattern • Prefer lower-exposure so signal will not be clipped

Colour/Lighting • Medium Y/C Resolution • Compression Artifacts • Exposure • Good Video quality • Source: DV

Archiving PAL: 720x576 @ 25 frames/second • DV Format: ~36Mbit/sec • ~16 GBytes/hour • MPEG2 @ 4-8Mbit/sec (DVD quality) • ~1.8-3.5 GB/hour • MPEG-1 @ 1.1 Mbit/sec • ~500MBytes/hour

Visual Aspects Summary • Video Camera • DV or Better • Progressive Scan Capability • Over-Exposure Indication, Zebra Patterns • Shooting • Use the zebra patterns at 70% • Zoom in as much as possible to increase subject’s resolution • Facial features must be visible for facial analysis • Try to avoid occlusions (hair, glasses, clothes, hand movement) • Uniform lighting conditions • Archive DV tapes, DV Video or Frames, (not MPEG-1)

Acoustic aspects • Why: “Toko, damato ma gali sa”? • Toko: solicitation by naming the interlocutor • Vowels found in majority of language • Meaning: Toko, can you open it? (request) for maintaining semantic aspect • Sampling frequency 44.1 kHz • 16 bits mono information depth • Uncompressed .wav files

Future processing • Process different modalities • Facial feature extraction • Gesture expressiveness analysis • Acoustic analysis • Gesture recognition • Synchronization • Modalities fusion • RNN • RSOM + Markov • SVM • … • Emotion recognition

Collection of multimodal data Face – Speech – Body

Collection of multimodal data Face – Speech – Body

Presentation Transcript

Training Objectives

Free Speech/1 st Amendment

NCAA Division I Academic Performance Program

TRANSPORTE MULTIMODAL

Special Education Data Collection in 2010-11 School Year

Data collection tools: Interview and Questionnaire Methods

Data and Data Collection

Domain 1: Preliminary Work and Collection of Taxpayer Data

Why Inner Speech?

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Theories and Data: Measurable Changes in Body and Mind during SM

Multimodal Deep Learning

Special Education Data Collection

Multimodal Input Analysis

A Tutorial on Bayesian Speech Feature Enhancement

Ch. 6: Face detection

California “In The Know” Inpatient Data Collection, Reporting, and Validation

Multimodal Analysis of Expressive Human Communication: Speech and gesture interplay

Feature Extraction for speech applications

Today

Objectives

C H A P T E R