A Perspective on Speech Technology Based on Human Mechanism

A Perspective on Speech Technology Based on Human Mechanism Jianwu Dang Japan Advanced Institute of Science and Technology NCMMC-2009

Signal Processing Speech communication in human and in machine From 60’s, studies on speech spreaded: Scientific way and Engineering way. The former focuses on human functions, and the latter on signal processing. NCMMC-2009

Comparison of HSR and ASR • HSR is a bottom–up, divide and conquer strategy • Humans recognize speech based on a hierarchy of context layers • As in vision, the entropy decreases as we integrate context • Humans have an intrinsic robustness to noise and filtering • HSR: robust articulation; excellent context model; plenty of knowledge • ASR: bad articulation; weak context models; few knowledge NCMMC-2009

How to learn from human NCMMC-2009

Contents of this talk • Discovering and Understanding Human Functions on Speech • Human Mechanism based Learning Approach • Speaker IDby Considering Physiological Feature • Articulatory Dynamics in Speech Recognition NCMMC-2009

Why human can robustly process speech? • Why can human robustly process speech even in some serious adverse environments? Hypotheses and theories • Speech chain is constructed in co-developing the functions of speech production and perception during language acquisition. • Motor theory of speech perceptiondescribesthat the acoustic signal is perceived in terms of articulatory gestures. • Topological mapping between the motor space and sensory space may be the key point for the efficiency of human speech processing. NCMMC-2009

Computational neural models (After Guenther 1996) NCMMC-2009

Human functions in speech Speech recognition/ understanding Intention/ Language Speech Chain Speech Planning Auditory-phonetic mapping Vision Articulation planning Auditory Map Differentiating Motion control Articulation/ Phonation Somatosensory receptor Perception model Aerodynamics Partner speaker Speech signal Speech communication NCMMC-2009

Experiment of transformed auditory feedback Two points: finish all processing with 30ms; keep all individual properties of the speaker NCMMC-2009

Formant difference caused by the TAF P0 is the time applying the TAF. P1 and P3 are the start and end points of the compensation. P2 is the point of the maximal compensation. NCMMC-2009

Compensation for the perturbation /i/ /e/ /a/ /u/ NCMMC-2009

Vocal tract shape of Chinese vowels NCMMC-2009

Extraction of intrinsic structure using similarity • Vocal tract shape is described by 8 points of UL, LL, LJ, T1 to T4, and the velum. • The initial vowel space consists of the vocal tract with 16 dimensions. • A similarity is measuredamongthe vowels in 16 dimensional space, and then a similarity graph is constructed for the vowels. NCMMC-2009

Distribution of articulatory place of vowels in continuous speech NCMMC-2009

Similarity based analysis • An ability to assess similarity lies close the core of cognition (Wilson, et al. 1999) • Geometric models are used in analysis of similarity • Euclidean metric (r=2) provides good fits to human similarity judgments NCMMC-2009

Construction of intrinsic space • A neighborhood keeping graphcan be obtained by minimizing the objective function • The mapping function can be obtained by solving the generalized eigenvalue as • The corresponding low dimensional embedding field can be described in NCMMC-2009

Vowel structure from read speech NCMMC-2009

3D vowel structure in articulatory space NCMMC-2009

Vowel Structure in Articulatory Space(12 vowels) NCMMC-2009

Vowel Structure in Articulatory Space　（11 vowels) NCMMC-2009

Vowel Structure in Articulatory Space(with and without lip protrusion) NCMMC-2009

Vowel Structure in Articulatory Space(with and without lip feature) NCMMC-2009

Homunculus image of the brain NCMMC-2009

Parameters for vowel structure in APS • An affine transform of a logarithmic spectrum can represent the auditory perception parameters (Wang, et al. 1995) • MFCC with 14 dim was used as acoustic parameters in the primary step, the same as that used in articulatory analysis. • Acoustic data were recorded with the articulatory data simultaneously, speech signals of the vowels are extracted from the stable period of each vowel. NCMMC-2009

Vowel structure in APS NCMMC-2009

3D vowel structure in auditory space NCMMC-2009

Comparison in 3D (Speaker 2) NCMMC-2009

Comparison in 3D (Speaker 3) NCMMC-2009

Relations between motor, sensory and articulatory spaces NCMMC-2009

Contents of this talk • Discovering and Understanding Human Function • Human Mechanism based Learning Approach • Speaker IDby Considering Physiological Feature • Articulatory Dynamics in Speech Recognition NCMMC-2009

Learning approaches • Distribution based learning approach Data dependent • Performance based learning approach Case dependent • What we want to learn? • Human Mechanism based learning approach • Model based learning NCMMC-2009

Human vs. model Typical phonetic target Typical phonetic target • The goal of low layer optimization: learning the planned target • The goal of high layer optimization: learning the typical targets of phonemes and coefficients of carrier model Carrier Model Planning mechanism High layer Planned target Planned target Low layer Articulation by the articulators Physiological articulatory model Observed Articulatory movements Simulated Articulatory movements B: Speech production of model A: Speech production of human NCMMC-2009

Construction of articulatory model based on MRI Extraction of Articulators Articulatory Model NCMMC-2009

Speech synthesis based physiological model Normal speech Emphasized speech NCMMC-2009

Development of the PhAM Tongue Epiglottis Mandible Thyroid cartilage Cricoid cartilage NCMMC-2009

Carrier model for coarticulation Phonetic target Ci Model sketch rci Planned target dci Ci’ Vj’ Virtual Target dvj Gi Vj Vj+1 dvj dvj+1 α β Fig.2 Based on this process, the planned targets are obtained by applying the carrier model on the typical articulatory targets Perturbation model　（Öhman）　　　　　　　 Lookahead model 　（Henke） Carrier Model(Dang et al） NCMMC-2009

Flowchart of the low layer Typical phonetic target Typical phonetic target Look ahead mechanism Carrier Model Planned target Calculated planned target learned Planned target Physiological articulatory model articulators such as tongue and jaw Articulators’ movements Articulatory model’s movements Arrive small difference? N Tuning the planned targets Y Optimal planned targets NCMMC-2009

Flowchart of the high layer Typical phonetic target Typical phonetic target Carrier Model Look ahead mechanism N Calculated planned target Reach threshold? Planned target learned Planned target Y Optimal phonetic targets and coefficients Physiological articulatory model articulators such as tongue and jaw Articulators’ movements Articulatory model’s movements NCMMC-2009

Observation and simulation in the low layer Cross marks: observations; Diamonds: simulations. The ellipses are referred to 95% confidence interval to cover the planned targets. NCMMC-2009

Simulation result of vowels Distribution of observed and simulated articulatory movements of 5 vowels obtained via the whole framework. The blue diamonds denote the simulations NCMMC-2009

Contents of this talk • Discovering and Understanding Human Function • Human Mechanism based Learning Approach • Investigation and Application of Individual Characteristics • Fusion of Articulatory Constraint with Speech Recognition NCMMC-2009

Factors of the speaker individuals The major factors can be classified as: • Learned factors  social factors: • Dialects • Occupations,… • Inherent factors  Physical aspects: • Age, Gender,… • Physiological situations,… • Morphological of speech organs,… NCMMC-2009

Individuals derived from Morphologies • VT shape varies with articulator movement and generates distinctive phonetic information • Unmovable parts of the VT gives the individual information • The invariant parts of the vocal tract •  the nasal cavity, piriform fossa, laryngeal tube • The acoustic features induced by the above parts NCMMC-2009

Frontal sinus Sphenoid sinus Maxillary sinus Velum Piriform fossa Laryngeal cavity Vocal folds Details in vocal tract shapes Lips Tongue Red: movable Jaw NCMMC-2009

Morphologies of the vocal tract • The nasal and paranasal cavities (Dang, et al. 1994,1996) • The piriform fossa (Dang, et al. 1997) • The laryngeal tube concerned with F4 (Takemoto, et al. 2006) NCMMC-2009

Morphology effects on vowels NCMMC-2009

Evaluate morphological effects Speaker relevancy measurement using Fisher’s F-Ratio [Wolf, 1971] : Feature as subband spectrum. :Speech sample index. :Speaker index. NCMMC-2009

Discriminative score based on F-Ratio • Speaker relevant frequenciesare almost invariant for the five speech sessions. • Low frequency region from 50Hz to 300Hz  glottis • High frequency regions from 4kHz to 5.5kHz  Piriform fossa • High frequency region form 6.5kHz to 7.8kHz  Consonant • Middle frequency region  Linguistic information NCMMC-2009

How to design an algorithm • Enhancing the information around speaker relevant frequency regions • Two ways: • Increase the amplitude of the region • Increase the resolution of the region • What is human action? Increase the resolution  Design non-uniform frequency warping algorithm to emphasize the speaker relevant frequency regions NCMMC-2009

Comparison of frequency resolutions • Uniform : Linear frequency scale (no frequency warping) • Mel : Mel frequency scale (Mel frequency warping) • Non-Uniform: Non-uniform frequency scale (non-uniform frequency warping) Speaker individual feature is emphasized NCMMC-2009

A Perspective on Speech Technology Based on Human Mechanism

A Perspective on Speech Technology Based on Human Mechanism

Presentation Transcript

- based on a new Technology -

Perspective on Human Flourishing

Speech Segregation Based on Sound Localization

Recommendations Based on Speech Classification

A Perspective on Entrepreneurship

A RHETORICAL PERSPECTIVE ON TECHNOLOGY

A New Mechanism for Cardiomyopathy Based on Abnormal Rheology.

A Perspective on Human Genetics

A Strategic Perspective on

A nthropological Perspective on the Human Life Course

Propose a technology based on advanced materials....

Speech Segregation Based on Oscillatory Correlation

A Study on Detection Based Automatic Speech Recognition

Introduction Anemia classification based on the mechanism

PKU2U – A peer2peer GSS-API mechanism based on PKINIT

A Perspective on Managing Educational Technology Change

A Developmental Perspective on Technology in Language Education

A Legislative Perspective on Information Technology Budgeting

A Game Based on Speech Recognition

Speech-to-Speech Infrastructure Based on UIMA

Human Speech Mechanism and Organs of speech

A nthropological Perspective on the Human Life Course