Biologically Inspired Noise-Robust Speech Recognition for Both Man and Machine Mark D. Skowronski Computational NeuroEngineering Lab University of Florida March 26, 2004
Speech Recognition Motivation Speech #1 real-time communication medium among humans. • Advantages of voice interface to machines: • Hands-free operation • Speed • Ease of use
Man vs. Machine Man is a high-performance existence proof for speech processing in noisy environments. Can we emulate man’s performance by leveraging expert information into our systems? Wall Street Journal/Broadcast news readings, 5000 words Untrained human listeners vs. Cambridge HTK LVCSR system
Biologically Inspired Algorithms Expert Information is added in three applications: Speech enhancement for human listeners Feature extraction for automatic speech recognition Classification for automatic speech recognition
Biology: Lombard effect Speech Enhancement Motivations: • Noisy cell phone conversations • Public address systems • Aircraft cockpit What can we do to increase intelligibility when turning up the volumeis not an option? This work funded by the iDEN Technology Group of Motorola
The Lombard Effect Psychophysical changes in vocal characteristics, produced by a speaker in the presence of background acoustic noise: • Vocal effort (amplitude) increases • Duration increases • Pitch increases • Formant frequencies increase • Energy center of gravity increases • Consonant-to-Noise ratio increases Result: Intelligibility increases
Psychoacoustic Experiments Miller and Nicely (1955): AWGN to speech affects place of articulation and frication most, less so for voicing and nasality. Furui (1986): Truncated vowels in consonant-vowel pairs dramatically decreased in intelligibility beyond a certain point of truncation. These points correspond to spectrally dynamic regions. Bottom Line: Speech contains regions of relatively high phonetic information, and emphasis of these regions increases intelligibility.
SFM of “clarification” Voicing determined by the Spectral Flatness Measure (SFM): Xj(k) is the magnitude of the short-term Fourier transform of the jth speech window of length N. Solution: Energy Redistribution We redistribute energy from regions of low information content to regions of high information content while conserving overall energy. From Miller and Nicely: ER for Voiced/Unvoiced (ERVU) regions. M. D. Skowronski, J. G. Harris, and T. Reinke, J. Acoust. Soc. Am., 2002
Listening Tests Confusable set test, from Junqua* • 500 trials forced decision • 3 algorithms (control, ERVU, HPF) • 0 dB and -10 dB SNR, AWGN • unlimited playback over headphones • 25 participants, 30-45 minutes J. C. Junqua, J. Acoust. Soc. Am., 1993*
Listening Tests Results -10 dB SNR, white noise Errors decreased 20% compared to control. “S” “A” “E” “M”
Energy Redistribution Summary • Developed a real-time algorithm for cell phone applications using biological inspiration, • Increased intelligibility while maintaining naturalness and conserving energy, • Effective because everyday speech is not clearly enunciated, • ERVU is a novel approach to speech enhancement that works on either clean speech or noise-reduced speech. M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004b (in preparation)
ASR: Input Feature Extraction Classification Feature Extraction Information: phonetic, gender, age, emotion, pitch, accent, physical state, additive/channel noise. HFCC filter bank
Existing Algorithms Goal: emphasize phonetic information over other info streams. • Feature algorithms: • Acoustic: formant frequencies, bandwidths • Model based: linear prediction • Filter bank based: mel freq cepstral coeff (MFCC) Provides dimensionality reduction on quasi-stationary windows. “seven” Features Time
MFCC Filter Bank • Design parameters: FB freq range, number of filters • Center freqs equally-spaced in mel frequency • Triangle endpoints set by center freqs of adjacent filters Although filter spacing is determined by perceptual mel frequency scale, bandwidth is set more for convenience than by biological arguments.
HFCC Filter Bank HFCC: human factor cepstral coefficients • Decouples filter bandwidth from filter spacing, • Sets filter width according to the critical bandwidth of the human auditory system, • Uses Moore and Glasberg approximation of critical bandwidth, defined in Equivalent Rectangular Bandwidth (ERB). fc is critical band center frequency (KHz). M. D. Skowronski and J. G. Harris, ICASSP, 2002
HFCC with E-factor Linear ERB scale factor (E-factor) controls filter bandwidth E-factor = 1 E-factor = 3 • Controls tradeoff between local SNR and spectral resolution, • Exemplifies the benefits of decoupling filter bandwidth from filter spacing. M. D. Skowronski and J. G. Harris, J. Acoust. Soc. Am., 2004a (submitted)
ASR Experiments • Isolated English digits “zero” through “nine” from TI-46 corpus, 8 male speakers, • HMM word models, 8 states per model, diagonal covariance matrix, • Three MFCC versions (different filter banks), • Linear ERB scale factor (E-factor), • HFCC with E-factor (HFCC-E). Total: 37.9 million frames of speech, (>100 hours)
ASR Results White noise (global SNR), HFCC-E vs. D&M, Linear ERB scale factor (E-factor). M. D. Skowronski and J. G. Harris, ISCAS, 2003
HFCC Summary • Adds biologically inspired bandwidth to filter bank of popular speech feature extractor, • Provides superior noise-robust performance over MFCC and variants, • Allows for further filter bank design modifications, demonstrated by HFCC with E-factor, • HFCC has the same computational cost as MFCC, only the filter bank coefficients are adjusted: easy to implement.
Classification • HMM Limitations & Variations • Freeman Model Introduction • Model Hierarchy • Associative Memory • ASR Experiments Freeman’s Reduced KII Network This work funded by the Office of Naval Research grant N00014-1-1-0405
Variations: • Deng (1992): trended HMM • Rabiner (1986): autoregressive HMM • Morgan & Bourlard (1995): HMM/MLP hybrid • Robinson (1994): context-dependent RNN • Herrmann (1993): transient attractor network • Liaw & Berger (1996): dynamic synapse RNN ~ HMM Nonlinear Dynamic HMM Limitations & Variations • Limitations: • HMM is piece-wise stationary; speech is nonstationary, • Assumes frames are i.i.d.; speech is coarticulated, • State PDFs are data-driven; curse of dimensionality. Freeman (1997): non-convergent dynamic biological model
K0 cell, H(s) 2nd order low pass filter Reduced KII (RKII) cell (stable oscillator) Freeman Model Hierarchical nonlinear dynamic model of cortical signal processing from rabbit olfactory neo-cortex.
Generalization Associative Memory RKII Network High-dimensional, scalable network of stable oscillators. Fully connected M-cell and G-cell weight matrices (zero diagonal). • Capable of several dynamic behaviors: • Stable attractors (limit cycle, fixed point) • Chaos • Spatio-temporal patterns • Synchronization
Energy Synchronization Through Stimulation (STS) Oscillator Network Two regimes of operation as an associative memory of binary patterns: Network weights for each regime set by outer product rule variation and by hand. M. D. Skowronski and J. G. Harris, Phys. Rev. E, 2004 (in preparation)
Associative Memory Input Output Input Output Full: Partial: Noisy: Spurious:
ASR with RKII Network Two-Class Case • \IY\ from “she” • \AA\ from “dark” • 10 HFCC-E coeffs. converted to binary • Energy-based RKII associative memory • No overlap between learned centroids
Overlap controlled by binary feature conversion More overlap more spurious outputs ASR with RKII Network Three-Class Case • \IY\ from “she” • \AA\ from “dark” • \AE\ from “ask” • 18 HFCC-E coeffs. converted to binary • Energy-based RKII associative memory • Variable overlap between learned centroids
Freeman Model Summary • Contributions: • Documented impulse invariance discretization, • Developed software tools, enabling large-scale experiments, • Demonstrated stable attractors in Freeman model, • Explained attractor instability by transient chaos, • Proposed two regimes of associative memory, • Invented novel synchronization mechanism (STS), • Devised variation of outer product rule for oscillator network learning rule, • Proved practical probabilities concerning overlap in three-class case, • Applied novel static pattern classifier to ASR.
Conclusions • Developed novel speech enhancement algorithm, • Lombard effect indicates what to modify, • Psychoacoustic experiments indicate where to modify, • ERVU reduces human recognition error 20-40% in noisy environments. • Extended existing speech feature extraction algorithm, • Critical bandwidth used to decouple filter bandwidth and spacing, • HFCC-E demonstrates research tangent for ind. filter bandwidth, • HFCC-E improves ASR by 7 dB SNR. • Advanced knowledge of NLD models for info processing. • Applied model to ASR of static speech features, • Near-optimum performance of RKII network associative memory using first-order statistics.