Human Speech Communication

message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message Human Speech Communication

PCM (Pulse Code Modulation) • Transmit value of each speech sample • dynamic range of speech is about 50-60 dB • 11 bits/sample • maximum frequency in telephone speech is 3.4 kHz • sampling frequency 8 kHz 8000 x 11 = 88 kb/s Simple and universal but not very efficient

IN OUT Better quantization ? • Less quantization noise for weaker signals

m - law A - law Logarithmic PCM (m-law, A-law) • Finer quantization for each individual small amplitude sample • how about small signal samples surrounded by large ones? • it is the instantaneous signal energy which should determine the step

Differential coding current sample • For many natural signals, the difference between successive samples quantizes better than samples themselves • Even better, predict the current sample from the past ones and transmit the error of the prediction time

Differential predictive coding • DPCM • a single predictor reflecting global predictability of speech • predictor order up to 4-5 • delta modulation - gross quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate) • adaptive DPCM • new predictor for every new speech block • predictor needs to be transmitted together with the prediction error

Speech Coders

Linear model of speech production

A.G. Bell got it almost right

linear model of speech source filter speech changes slowly

long-term prediction current sample short-term prediction time short-term - resonance of vocal tract long-term - periodicity of voiced speech (vocal cord vibration)

LPC vocoder • The same principle as in H. Dudley’s Vocoder • Used by US Government (LPC-10) - 2.4 kbs

Residual Excited LPC (RELP) • Transmitter: • Simplify prediction error (low-pass filter and down-sample • Receiver • re-introduce high frequencies in the simplified residual (nonlinear distortion)

Analysis-by-synthesis • Identical synthesizer in coder and in decoder • change parameters in coder • use for synthesizing speech • compare synthesized speech with real speech • when “close enough”, send parameters to the receiver

Future in speech coding? • No need to transmit what we do not hear • study human hearing, especially masking • No need to transmit what is predictable • speech production mechanism • speaker characteristics • linguistic code (recognition-synthesis) • thought-to-speech

reduce information = decrease entropy electric signal (more than 50 kb/s) prior knowledge ( textbook ) acquired knowledge ( data ) Automatic recognition of speech knowledge phoneme string (below 50 b/s) linguistic message • Automatic speech recognition (ASR) • derive proper response from speech stimulus • Auditory perception • how do biological systems respond to acoustic stimuli • Knowledge of auditory perception ?

Principle of stochastic ASR • Using a model of speech production process, generate all possible acoustic sequences wi for all legal linguistic messages • Compare all generated sequences with the unknown acoustic input x to find which one is the most similar • What is the model M ( wi )? • Form of the datax ?

h e l o w o r l d u One (simple) model hello world • Two dominant sources of variability in speech • people say the same thing with different speeds ( temporal variability ) • different people sound different, communication environment different, ( feature variability) • “Doubly stochastic” process (Hidden Markov Model) • Speech as a sequence of hidden states - recover the state sequence • never know for sure in which state we are • never know for sure which data can be generated from a given state

hi hi hi hi hi hi hi hi hi hi m f m m m m m f m m pm pf P(sound|gender) The model pm-f m p1m f m f f0 pf-m Hidden Markov Model f0=160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz sequence of male and female groups?

160 170 160 170 200 110 140 240 170 190 f m m f m x units of speech (phonemes) What the x shouldbe ?

Speech signal ? • always also carry some irrelevant information • additional processing is necessary to alleviate it • Reflects changes in acoustic pressure • its original purpose is reconstruction of speech • does carry relevant information

speech signal histogram correlations

Isaac Newton averaged fft spectra of some vowels from 3 hours of fluent speech l/4 beer /uw//ao//ah//eh//ih//iy/ Where Is The Message ? /u/ /o/ /a/ /e/ /iy/ • it is in the spectrum !!

Steam Engine (1769) Internal Combustion Engine (2003) Inertia in engineering

time frequency get spectral components time Short-term Spectrum 10-20 ms /j/ /u/ /ar/ /j/ /o/ /j/ /o/

short-term speech spectral envelope histogram correlations

logarithmic short-term speech spectral envelope histogram correlations

cosine transform of logarithmic short-term speech spectral envelope(cepstrum) histogram correlations

short-term spectrum frequency auditory-like modifications “auditory-like” spectrum What Is Wrong With the Short-term Spectrum ? 1) inconsistent (same message, different representation)

Frequency resolution of human ear decreases with frequency Pitch of the tone (Mel scale)

t FFT f S “critical-band energy” Emulating frequency resolution of human ear with FFT

Equal Loudness Curves

Perceptual Linear Prediction (PLP)[Hermansky 1990] • Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model • critical-band spectral resolution • equal-loundness sensitivity • intensity-loudness nonlinearity • Today applied in virtually all state-of-the-art experimental ASR systems

Spectral Basis from LDA LDA gives basis for projection of spectral space frequency /j/ /u/ /ar/ /j/ /o/ /j/ /o/ time

16 % 63 % 2 % 12 % LDA vectors from Fourier Spectrum Spectral resolution of LDA-derived spectral basis is higher at low frequencies Critical bands of human hearing are narrower at lower frequencies

Sensitivity to Spectral Change(Malayath 1999) Cosine basis LDA-derived bases Critical-band filterbank

if the receiver could be controlled put more resources (introduce less noise) where there is more signal biological system optimized for information extraction from sensory signals Combination of channel and signal spectrum should be as flat (as random-like) as possible. Shannon, Communication in presence of noise (1949) energy of the signal level of noise in the channel level of noise in the channel energy of the signal resource space resource space if signal could be controlled (e.g. in communication) • put more signal where there is less noise • sensory signal optimized for a given communication channel

f spectrum additive band-limited noise linear (high-pass) filtering f What Is Wrong With the Short-term Spectral Envelope? 2) Fragile (easily corrupted by minor disturbances) f ignore the noisy parts of the spectrum remove means from parts of the spectrum

critical bandwidth Simultaneous Masking band-pass filtered noise centered at f • Nonlinear frequency resolution of hearing • Critical bands • up to ~600 Hz constant bandwidth • above 1 kHz constant Q tone at f threshold of perception of the tone noise bandwidth

Replace spectral vector by a matrix of posterior probabilities of acoustic events S ( frequency ) {p(f)} pf1 pf2 pf3 pf4 pf5 pf6 frequency ( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 ) More Important Outcome of Masking Experiments • What happens outside the critical band does not affect detection of events within the band !!! • Independent processing of parts of the spectrum ?

h e l o w o r l d u human auditory perception coarticulation What Is Wrong With the Short-term Spectral Envelope? 3) Coarticulation (inertia of organs of speech production)

masker increase in threshold signal stronger masker t time 0 t 200 ms Masking in Time • suggests ~200 ms buffer in auditory system • also seen in perception of loudness, detection of short stimuli, gaps in tones, auditory afterimages, binaural release from masking, ….. • what happens outside this buffer, does no affect detection of signal within the buffer

~10 ms time processing data x longer time span ? (~250 ms?) time Short-term Features?

time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron Average of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished ) Cortical Receptive Fields

250-1000 ms 1-3 critical bands FREQUENCY TIME [s] Data for Deriving Posterior Probabilities of Speech Events

time 10-20 ms time 200-1000 ms data x 1-3 Bark 200-1000 ms 1-3 Bark all-pole model of part of time-frequency plane 200-1000 ms How to Get Estimates of Temporal Evolution of Spectral Energy ?- with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague)

Human Speech Communication

Human Speech Communication

Presentation Transcript

Speech Communication

Speech Communication

HUMAN COMMUNICATION

Human communication

Human Communication

Human Communication

Human Communication

SPEECH COMMUNICATION

Human Speech Recognition

Human Speech Perception

Human communication

Communication 1010 Speech

Human Communication

Speech Communication

Human Communication

Human Speech Communication

Human Communication

Human Communication

MM9 Speech Communication

Module u1: Speech in the Interface 6: Human Communication

Human Communication

(Human) Communication