METISS

METISS Audio & speech processing INRIA-Rennes Modélisation et Expérimentation pour le Traitement des Informations et des Signaux Sonores Scientific leader : Frédéric BIMBOT Overview of activities 2002-2005 Evaluation INRIA

Introduction Evaluation INRIA

Framework and foundations • General framework • Scientific foundations • Probabilistic models and statistical estimation • Redundant systems and adaptive representations • analysis, processing • modelling, representation description, decomposition • detection, classification • recognition • audio • speech • music • multimedia • … • signals • recordings • streams • tracks • … of Audio scene analysis, description and recognition Evaluation INRIA

Scientific objectives • to design generic, robust, fast and flexible approaches to a variety of problems in speech and audio segmentation, detection and classification, operating in the probabilistic framework • to investigate on theoretical properties and practical applications of adaptive representations and sparseness criteria with the purpose of advanced processing and structured description of audio signals • to extend and adapt approaches classically used in the context of speech processing to other classes of signals and problems • to study convergence between statistical approaches and adaptive decomposition within a common framework embedding signal representations and classification Evaluation INRIA

Application domain and focus • Applicative fields • Security, verification, authentication, rights management • Rich audio transcription, content-based indexing, multi-purpose navigation, information retrieval and summarization • Advanced audio processing : segmentation, separation, spatialisation, sound object extraction, music modeling • Audio and audio-visual authoring, production and repurposing • Education and entertainement • Primary focuses • Speaker characterisation • Audio structuring and indexing • Sparse representations : theory and applications • Audio source separation (under-determined case) Evaluation INRIA

BIMBOT GRAVIER GRIBONVAL POREE BETSER KIJAK KRSTULOVIC GONON BEN MORARU BLOUET BENAROYA MC DONAGH BEN COLLET LESAGE OZEROV SALL FORTHOFER HUET TENG ARBERET MAILHE Team composition 2005 2003 2002 2004 Permanent researchers (CR - CNRS or INRIA) 3 Non-permanent staff (Engineers, ATER, Post-Doc) 2 PhD ~ 50 % with METISS PhD - 100 % with METISS 2 3 + Marie-Noëlle Georgeault  administrative assistant (~ 25 %) Evaluation INRIA

Probabilistic modeling of audio signals Evaluation INRIA

Probabilistic modeling (1) 1 audio class or 1 sound object  a variety of observations 1 family of sounds  1 probabilistic model 1 probability density function 1 likelihood function Evaluation INRIA

Probabilistic modeling (2) Probabilistic modeling Statistical estimation State-sequence decoding Bayesian decision + « know-how » Detection Classification Verification Segmentation …  Probabilistic models offer a well-understood generic inter-operable framework for the description and the classification of audio and speech signals • Dominant position of Hidden Markov Models (HMM) (and variants) • Highly competitive field in speech processing (research & industry) • More open in audio indexing (additional factors of complexity) Evaluation INRIA

Challenges and positioning Generalisation to wider classesof signals with an audio component  multiple scales  multiple sources  multiple structures  multiple sensors  multiple levels of underlying processes  heterogeneous streams (audio-visual)  external sources of knowledge Robustness  to unseen acoustic conditions  to scarce training data  to poorly representative samples  to missing observations  to … Implementability  size  speed  scalability  distribution  etc … METISS positioning : - robust training and test methods - compact distributed algorithms - versatility / migration of formalism - methodology and evaluation  speaker verification  audio segmentation  broad sound-class indexing ( speech recognition) Evaluation INRIA

Adaptive representations Evaluation INRIA

Adaptive representations (1) Audio signal : • diversity of structures (time, frequency, statistics,…) • superimposition of objects (notes, sources, tracks, …) Redundant system (dictionary of atoms) Adaptive decomposition with • Selection of the« best » decomposition,according to a given criterion : • sparsity • perception criterion • separability • conditional entropy • … • Large set of vectors with various : • scales • time structures • frequency structures • phases • statistical properties • … Evaluation INRIA

Constraint : Criterion : Adaptive representations (2) Sparsity criteria Decomposition  = 2 : quadratic norm  maximizes dispersion  = 0 : minimum non-zero coefficient  NP-complete  = 1 : tractable « compromise »  Pursuit algorithms (Matching Pursuit) Evaluation INRIA

METISS positioning : - theoretical results - concepts and methodologies - decomposition algorithms  audio source separation (under-determined case) Ongoing scientific issues • Optimality and convergence of adaptive decompositions • Dictionary design (knowledge-based, data driven, …) • Deformable, stochastic, multi-dimensional, … atoms • Efficient decomposition algorithms and implementations • Application scope • Recent fast-growing field • High applicative potential • Intense emerging competition Evaluation INRIA

Achievements2002-2005and selected results • Speaker characterisation • Audio structuring and indexing • Sparse representations : theory and applications • Audio source separation (under-determined case) Evaluation INRIA

Speaker characterisation • CART trees for scalable and distributable speaker verification • Model-based metrics and normalisations for speaker verification • Structural adaptation of speaker models (hierarchical Bayesian networks) • Methodology and algorithms for optimizing the coverage of a speaker database • Relative speaker space and metrics for efficient speaker indexing and retrieval [ongoing] Evaluation INRIA

CART based speaker verification Blouet, Bimbot, Gonon, et al. direct score function assignment  CART Trees used as a family of approximating functions -0.8 NO YES 0.7 NO 0.3 YES YES NO -0.4 -0.8 0.7 0.9 -0.4 NO + Extension to oblique trees YES -0.5 0.9 NO YES -0.5 0.3 complexity down 200 x error rate up 33% only EU-IST INSPIRED Project Evaluation INRIA

Speaker recognition inthe model space (1) Ben, Bimbot et al. Formal links between LLR and KL-divergence + mean-only adaptation training procedure likelihood ratio test ~= Euclidean distance in the model space  Evaluation INRIA

Tested successfully for speaker recognition for NIST and ESTER campaigns Speaker recognition inthe model space (2) Ben, Bimbot et al. Consequences : - faster score computation procedure (at least -50%) - simpler normalization schemes (M-Norm) no need of additional development data with no performance degradation Evaluation INRIA

Audio indexing • HMM-based audio and audio-visual structuring (applied to sports programmes) • Audio segmentation and tracking using probabilistic models and statistical tests • Detection of simultaneous events in audio tracks • Granular models of audio signals using deformable atoms • Comparison and evaluation of beam-search techniques and hypothesis rescoring using external sources of knowledge [ongoing] • Algebraic representations and statistical modeling of formal music [ongoing] Evaluation INRIA

Multi-stream HMM modeling (1)of a tennis match Kijak et al. (with TMM) multi-level state-sequence representation of a tennis match inspired and adapted from the speech recognition paradigms  multi-stream audio-visual HMM Evaluation INRIA

Multi-stream HMM modeling (2) Delakis, Gravier et al. (with TexMex) segmental models  relaxed synchrony constraints Video+Audio Shot-based + segmental C = 85% Video-only Shot-based C = 77%  Evaluation INRIA

Sparse representations • Mathematical test for the optimality of a sparse representation • Matching pursuit made tractable (1 hour  0.25 x RT) • Structured matching pursuit incorporating explicit signal family models • Adaptive computational strategies • Beyond sparsity : recovering structured representations… • Learning shift-invariant atoms (MoTIF algorithms) [ongoing] Evaluation INRIA

Sparse solutions to inverse linear problems Gribonval et al. In the under-determined case : BUT if : If a sparse representation is sparse enough, then it is the sparsest one Evaluation INRIA

Matching Pursuit made tractable Gribonval, Krstulovic et al. C++ ToolkitGPL Licence MPTK flexible operation reproducible results for a 1 hour audio signalprocessing time reduced from 20 h  0.25 h usable in other fields : medical signals, sismology, etc … Evaluation INRIA

Source separation(with primary focus on undertermined problems) • Statistical schemes and adaptive training for single-channel separation • Source separation approaches using multi-channel Matching Pursuit in the underdetermined case • Contributions in evaluation methodology : task definition & performance measurements • Speech « denoising » using underdetermined sources separation techniques • Dictionary design methods for source separation [ongoing] • DEMIX : a robust algorithm to estimate the number of sources using clustering techniques [ongoing] Evaluation INRIA

Single sensor audio source separation Observed signalVoice + Music Benaroya, Bimbot, Gribonval, Ozerov (with FTR&D) EstimatedVoice signal Factorial GMM Voice GMM Use of a factorial GMM to build a time-varying Wiener filter Music GMM Wiener filter Article in IEEE Trans SAP 2006 + new results to come • innovative scheme for underdetermined source separation • compatibility with speech processing state-of-the-art • strong links with sparse decomposition problems • versatile and efficient for a range of audio description tasks Evaluation INRIA

Underdetermined stereophonicsource separation using sparse method Lesage, Gribonval et al. Mixing matrix Separation Audio examples available least squares  sparsity  Evaluation INRIA

Collaborations, Disseminationand Visibility • Privileged cooperation with the TEXMEX group at IRISA (+ VISTA) • Consistent network of academic and industrial partners outside IRISA • Regular participation to collaborative projects (EU-IST, RNRT, bilateral partnership, …) • Strong involvement in concerted research actions (ESTER, MathSTIC, GDR-ISIS, NIST evaluations, …) • Visible participation to and production of free software : ELISA platform, AudioSeg, MPTK, SIROCCO, BSS-EVAL • Sustained effort of publication and dissemination of the group research results • Additional visibility through responsability taking in scientific societies, workshop organisation and editorial boards Evaluation INRIA

Summary 2002-2005Strategy and perspectives2006-2010 Evaluation INRIA

Achievements 2002-2005 (1) • solid contributions to the state-of-the art with respect to several topics related to speaker and audio class modelling and recognition • key extension, experimentation and validation of the Hidden Markov Model framework for joint audio and video modelling and structuring • major theoretical and experimental progress in the field of sparse representations and adaptive decomposition • pioneering work in mono- and multi-channel source separation in the underdetermined case Evaluation INRIA

Achievements 2002-2005 (2) • strategic improvement in the efficiency of pursuit algorithms both in terms of search strategy and implementation • development of a usable know-how in keyword spotting and speech recognition • sustained activities in assessment methodology, resource distribution and evaluation campaigns • scientific objective #4 needs consolidation Evaluation INRIA

Strategy 2006-2010 • To keep our position in our initial field of expertise :models, algorithms and tools for automatic processing of audio and speech signal • To push our advantage in the field of sparse representations, both from the theoretical and applicative viewpoint. • To extend our scope towards more powerful approaches for the representation and modeling of audio and multi-modal signals with an audio component • To step in and progress in the area of compressing large-scale high-dimensional multi-modal data Evaluation INRIA

Scientific challenges • Probabilistic multi-level multi-stream dependency models for the representation of multiple sources and the integration of heterogeneous levels of knowledge in audio (-visual) streams Bayesian networks • Data-driven representations, model discovery and self-structuring of information in audio and audio-visual streams and contents theoretical consolidation • Experimental platforms and numerically efficient algorithms for large scale data and near real-time processing  engineering work • Deeper understanding of the links betweentheoretical concepts of adaptive representation, sparse decomposition, multi-scale analysis and pratical implications in terms of robustness, separability and adaptability potential links with SVM • Compressing large-scale high-dimensional multimodal data for storage, description and classification  compressed sensing Evaluation INRIA

Questions Evaluation INRIA

METISS

METISS

Presentation Transcript