Fundamentals of Speech Signal Processing

Fundamentals of Speech Signal Processing

1.0 Speech Signals

Waveform plots of typical vowel sounds - Voiced（濁音） tone 2 tone 1 tone 4 t

Speech Production and Source Model • Human vocal mechanism • Speech Source Model Vocal tract x(t) u(t)

Voiced and Unvoiced Speech x(t) u(t) voiced pitch pitch unvoiced

Unvoiced （清音） Voiced （濁音） Waveform plots of typical consonant sounds

Waveform plot of a sentence

Voiced Unvoiced Frequency domain spectra of speech signals

Frequency Domain Voiced formant frequencies

Frequency Domain Unvoiced formant frequencies

Spectrogram

Formant Frequencies

Formant frequency contours He will allow a rare lie. Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang

2.0 Speech Signal Processing

Speech Signal Processing ^ x[n] • Major Application Areas • Speech Coding:Digitization and Compression • Considerations : 1) bit rate (bps) • 2) recovered quality • 3) computation • complexity/feasibility • Voice-based Network Access — • User Interface, Content Analysis, User-content Interaction x(t) x[n] Processing Algorithms LPF output • Speech Signals • Carrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc. • Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level • Processing and Interaction of the Double-level Information x[n] xk 110101… Inverse Processing Processing Storage/transmission

Sampling of Signals X[n] X(t) n t

Double Levels of Information 字(Character) 詞(Word) 人人用電腦電腦句(Sentence)

今天的 天氣非常好今天的天氣非常好今天的 • Speech Signal • Sampling • Processing Speech Signal Processing – Processing of Double-Level Information Algorithm Chips or Computers • Linguistic Structure • Linguistic Knowledge Lexicon Grammar

Voice-based Network Access Internet User Interface Content Analysis User-Content Interaction • User Interface —when keyboards/mice inadequate • Content Analysis — help in browsing/retrieval of multimedia content • User-Content Interaction —all text-based interaction can be accomplished by spoken language

User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals Text Content Internet Networks Multimedia Content • at Any Time, from Anywhere • Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable Devices… • Small in Size, Light in Weight, Ubiquitous, Invisible… • Post-PC Era • Keyboard/Mouse Most Convenient for PC’s not Convenient any longer — human fingers never shrink, and application environment is changed • Service Requirements Growing Exponentially • Voice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere, and to the point in one utterance • Speech Processing is the only less mature part in the Technology Chain

Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content Future Integrated Networks • Real–time • Information • weather, traffic • flight schedule • stock price • sports scores • Private Services • personal notebook • business databases • home appliances • network • entertainments • Intelligent Working • Environment • e–mail processors • intelligent agents • teleconferencing • distant learning • electric commerce • Knowledge • Archieves • digital libraries • virtual museums • Special Services • Google • FaceBook • YouTube • Amazon • Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text) • Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse • The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval • Multimedia Content Analysis based on Speech Information

User-Content Interaction— Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing text information Multimedia Content Text-to-Speech Synthesis Text Content voice information Spoken and multi-modal Dialogue Voice-based Information Retrieval Internet voice input/ output MultimediaContent Analysis Text Information Retrieval • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech • User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues • Hand-held Devices with Multimedia Functionalities Commonly used Today • Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech Information

3.0 Speech Coding

Waveform-based Approaches • Pulse-Coded Modulation (PCM) • binary representation for each sample x[n] by quantization • Differential PCM (DPCM) • encoding the differences • d[n] = x[n]  x[n1] • d[n] = x[n] akx[nk] • Adaptive DPCM (ADPCM) • with adaptive algorithms • Ref : Haykin, “Communication Systems”, 4-th Ed. • 3.7, 3.13, 3.14, 3.15 P k=1

Speech Source Model and Source Coding • Speech Source Model G(),G(z), g[n] Ex u[n] x[n] Excitation Generator Vocal Tract Model U () U (z) x[n]=u[n]g[n] X()=U()G() X(z)=U(z)G(z) parameters parameters • digitization and transmission of the parameters will be adequate • at receiver the parameters can produce x[n] with the model • much less parameters with much slower variation in time lead to much less bits required • the key for low bit rate speech coding

Speech Source Model x(t) t a[n] n

Speech Source Model and Source Coding • Analysis and Synthesis • High computation requirements are the price for low bit rate

Simplified Speech Source Model G(z), G(), g[n] unvoiced voiced x[n] random sequence generator G(z) = 1 1 akz-k u[n]  P k = 1 periodic pulse train generator G Vocal Tract Model v/u N Excitation • Vocal Tract parameters • {ak} : LPC coefficients • formant structure of speech signals • A good approximation, though not precise enough • Excitation parameters • v/u : voiced/ unvoiced • N : pitch for voiced • G : signal gain •  excitation signal u[n]

LPC Vocoder(Voice Coder) N by pitch detection v/u by voicing detection {ak} can be non-uniform or vector quantized to reduce bit rate further Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993

Multipulse LPC • poor modeling of u(n) is the main source of quality degradation in LPC vocoder • u[n] replaced by a sequence of pulses • u[n] = bk[nnk] • roughly 8 pulses per pitch period • u[n] close to periodic for voiced • u[n] close to random for unvoiced • Estimating (bk , nk) is a difficult problem • k

Multipulse LPC • Estimating (bk , nk) is a difficult problem • analysis by synthesis • large amount of computation is the price paid for better speech quality

Multipulse LPC • Perceptual Weighting • W(z) = (1ak z -k)／(1akck z -k) • 0 < c < 1 for perceptual sensitivity • W(z) = 1 , if c = 1 • W(z) = 1ak z -k , if c = 0 • practically c 0.8 • Error Evaluation • E = |X()  X() |2 W()d • P k = 1 • P k = 1 ^

Multipulse LPC • Error Minimization and Pulse search • u[n] = bk[nnk] • x[n] = bkg[nnk] • E = E ( b1 , n1 ,b2 , n2……) • sub-optional solution finding 1 pulse at a time • k ^ • k

Code-Excited Linear Prediction (CELP) • Use of VQ to Construct a Codebook of Excitation Sequences • a sequence consists of roughly 40 samples • a codebook of 512 ~ 1024 patterns is constructed with VQ • roughly 512 ~ 1024 excitation patterns are perceptually adequate • Excitation Search –analysis by synthesis • 9 ~ 10 bits are needed for the excitation of 40 samples , while {} parameters in G(z) also vector quantized

Code-Excited Linear Prediction (CELP) • Receiver • {} codewords can be transmitted less frequently than excitation codeword • Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33

4.0Speech Recognition and Voice-based Network Access

Speech Recognition as a pattern recognition problem W X x(t) Feature Extraction Pattern Matching Decision Making unknown speech signal output word feature vector sequence y(t) Y Reference Patterns Feature Extraction training speech

Basic Approach for Large Vocabulary Speech Recognition Input Speech Feature Vectors Linguistic Decoding and Search Algorithm Output Sentence Front-end Signal Processing Language Model Acoustic Model Training Speech Corpora Acoustic Models Text Corpora Language Model Construction Lexicon • A Simplified Block Diagram • Example Input Sentence • this is speech • Acoustic Models (聲學模型) • (th-ih-s-ih-z-s-p-ih-ch) • Lexicon (th-ih-s) → this • (ih-z) → is • (s-p-iy-ch) → speech • Language Model(語言模型)(this) – (is) – (speech) • P(this) P(is | this) P(speech | this is) • P(wi|wi-1) bi-gramlanguage model • P(wi|wi-1,wi-2) tri-gram language model,etc

Observation Sequences

State Transition Probabilities 1-dim Gaussian Mixtures

Simplified HMM RGBGGBBGRRR……

Peripheral Processing for Human Perception

Mel-scale Filter Bank

N-gram W1W2W3 W4 W5W6 ......WR tri-gram W1W2W3W4W5W6 ......WR

Fundamentals of Speech Signal Processing