Multimodal Deep Learning

Multimodal Deep Learning JiquanNgiam AdityaKhosla, Mingyu Kim, Juhan Nam, HonglakLee & Andrew Ng Stanford University

McGurk Effect Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Audio-Visual Speech Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Challenge Classifier (e.g. SVM) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Representing Lips • Can we learn better representations for audio/visual speech recognition? • How can multimodal data (multiple sources of input) be used to find better features? Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Unsupervised Feature Learning 5 1.1 . . . 10 9 1.67 . . . 3 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Features 1 2.1 5 9 . . . . . . . 6.5 9 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-Modality Feature Learning 5 1.1 . . . 10 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Learning Models Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Learning with Autoencoders Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shallow Learning • Mostly unimodal features learned Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Hidden Units Audio Input Video Input

Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Autoencoder Video Reconstruction Audio Reconstruction Hidden Representation ... ... ... ... Video Input Cross-modality Learning: Learn better video features by using audio as a cue Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Cross-modality Deep Autoencoder Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Cross-modality Deep Autoencoder Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Audio Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) “Phonemes” Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction ... ... ... ... ... ... ... “Phonemes” Audio Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders Video Reconstruction Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... “Visemes” (Mouth Shapes) “Phonemes” Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Training Bimodal Deep Autoencoder Audio Reconstruction Video Reconstruction Video Reconstruction Audio Reconstruction Audio Reconstruction Shared Representation Shared Representation Shared Representation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... • Train a single model to perform all 3 tasks • Similar in spirit to denoisingautoencoders • (Vincent et al., 2008) Audio Input Video Input Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Evaluations Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Visualizations of Learned Features 0 ms 33 ms 67 ms 100 ms 0 ms 33 ms 67 ms 100 ms Audio (spectrogram) and Video features learned over 100ms windows Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Lip-reading with AVLetters Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input • AVLetters: • 26-way Letter Classification • 10 Speakers • 60x80 pixels lip regions • Cross-modality learning Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Lip-reading with CUAVE Audio Reconstruction Learned Representation ... ... ... ... ... ... ... Video Input • CUAVE: • 10-way Digit Classification • 36 Speakers • Cross Modality Learning Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Multimodal Recognition Audio Reconstruction Shared Representation ... ... ... ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng • CUAVE: • 10-way Digit Classification • 36 Speakers • Evaluate in clean and noisy audio scenarios • In the clean audio scenario, audio performs extremely well alone

Multimodal Recognition Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shared Representation Evaluation Supervised Testing Linear Classifier Shared Representation Shared Representation Audio Audio Video Video Training Testing Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shared Representation Evaluation Supervised Testing Linear Classifier Shared Representation Shared Representation Audio Audio Video Video Training Testing Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Method: Learned Features + Canonical Correlation Analysis

McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Video Reconstruction Conclusion Video Reconstruction Audio Reconstruction Audio Reconstruction Shared Representation Learned Representation ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Video Input Audio Input Video Input • Applied deep autoencoders to discover features in multimodal data • Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue • Multimodal Feature Learning: Learn representations that relate across audio and video data Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Learning with RBMs Hidden Units ... ... …... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Deep Learning

Multimodal Deep Learning

Presentation Transcript

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Multimodal Learning Environments

Deep Learning

Deep Learning

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

Deep Learning!!!!

Deep learning

Deep Learning

Deep Learning

Deep Learning

Deep Learning

Multimodal Learning Environment

MultiModal Learning Environment

Multimodal Learning Environments ( mmle )

Deep Learning

Deep Learning

Multimodal Learning

Deep learning

Deep Learning

Deep Learning