620 likes | 650 Vues
Fundamentals of Speech Signal Processing. 1.0 Speech Signals. Waveform plots of typical vowel sounds - Voiced (濁音). tone 2. tone 1. tone 4. t. Speech Production and Source Model. Human vocal mechanism. Speech Source Model. Vocal tract. x(t). u(t). Voiced and Unvoiced Speech.
E N D
Fundamentals of Speech Signal Processing
Waveform plots of typical vowel sounds - Voiced(濁音) tone 2 tone 1 tone 4 t
Speech Production and Source Model • Human vocal mechanism • Speech Source Model Vocal tract x(t) u(t)
Voiced and Unvoiced Speech x(t) u(t) voiced pitch pitch unvoiced
Unvoiced (清音) Voiced (濁音) Waveform plots of typical consonant sounds
Voiced Unvoiced Frequency domain spectra of speech signals
Frequency Domain Voiced formant frequencies
Frequency Domain Unvoiced formant frequencies
Formant frequency contours He will allow a rare lie. Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang
Speech Signal Processing ^ x[n] • Major Application Areas • Speech Coding:Digitization and Compression • Considerations : 1) bit rate (bps) • 2) recovered quality • 3) computation • complexity/feasibility • Voice-based Network Access — • User Interface, Content Analysis, User-content Interaction x(t) x[n] Processing Algorithms LPF output • Speech Signals • Carrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc. • Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level • Processing and Interaction of the Double-level Information x[n] xk 110101… Inverse Processing Processing Storage/transmission
Sampling of Signals X[n] X(t) n t
Double Levels of Information 字(Character) 詞(Word) 人人用電腦 電腦 句(Sentence)
今 天 的 天 氣 非 常 好 今天的 天氣 非常 好 今天 的 • Speech Signal • Sampling • Processing Speech Signal Processing – Processing of Double-Level Information Algorithm Chips or Computers • Linguistic Structure • Linguistic Knowledge Lexicon Grammar
Voice-based Network Access Internet User Interface Content Analysis User-Content Interaction • User Interface —when keyboards/mice inadequate • Content Analysis — help in browsing/retrieval of multimedia content • User-Content Interaction —all text-based interaction can be accomplished by spoken language
User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals Text Content Internet Networks Multimedia Content • at Any Time, from Anywhere • Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable Devices… • Small in Size, Light in Weight, Ubiquitous, Invisible… • Post-PC Era • Keyboard/Mouse Most Convenient for PC’s not Convenient any longer — human fingers never shrink, and application environment is changed • Service Requirements Growing Exponentially • Voice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere, and to the point in one utterance • Speech Processing is the only less mature part in the Technology Chain
Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content Future Integrated Networks • Real–time • Information • weather, traffic • flight schedule • stock price • sports scores • Private Services • personal notebook • business databases • home appliances • network • entertainments • Intelligent Working • Environment • e–mail processors • intelligent agents • teleconferencing • distant learning • electric commerce • Knowledge • Archieves • digital libraries • virtual museums • Special Services • Google • FaceBook • YouTube • Amazon • Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text) • Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse • The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval • Multimedia Content Analysis based on Speech Information
User-Content Interaction— Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing text information Multimedia Content Text-to-Speech Synthesis Text Content voice information Spoken and multi-modal Dialogue Voice-based Information Retrieval Internet voice input/ output MultimediaContent Analysis Text Information Retrieval • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech • User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues • Hand-held Devices with Multimedia Functionalities Commonly used Today • Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech Information
Waveform-based Approaches • Pulse-Coded Modulation (PCM) • binary representation for each sample x[n] by quantization • Differential PCM (DPCM) • encoding the differences • d[n] = x[n] x[n1] • d[n] = x[n] akx[nk] • Adaptive DPCM (ADPCM) • with adaptive algorithms • Ref : Haykin, “Communication Systems”, 4-th Ed. • 3.7, 3.13, 3.14, 3.15 P k=1
Speech Source Model and Source Coding • Speech Source Model G(),G(z), g[n] Ex u[n] x[n] Excitation Generator Vocal Tract Model U () U (z) x[n]=u[n]g[n] X()=U()G() X(z)=U(z)G(z) parameters parameters • digitization and transmission of the parameters will be adequate • at receiver the parameters can produce x[n] with the model • much less parameters with much slower variation in time lead to much less bits required • the key for low bit rate speech coding
Speech Source Model x(t) t a[n] n
Speech Source Model and Source Coding • Analysis and Synthesis • High computation requirements are the price for low bit rate
Simplified Speech Source Model G(z), G(), g[n] unvoiced voiced x[n] random sequence generator G(z) = 1 1 akz-k u[n] P k = 1 periodic pulse train generator G Vocal Tract Model v/u N Excitation • Vocal Tract parameters • {ak} : LPC coefficients • formant structure of speech signals • A good approximation, though not precise enough • Excitation parameters • v/u : voiced/ unvoiced • N : pitch for voiced • G : signal gain • excitation signal u[n]
LPC Vocoder(Voice Coder) N by pitch detection v/u by voicing detection {ak} can be non-uniform or vector quantized to reduce bit rate further Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993
Multipulse LPC • poor modeling of u(n) is the main source of quality degradation in LPC vocoder • u[n] replaced by a sequence of pulses • u[n] = bk[nnk] • roughly 8 pulses per pitch period • u[n] close to periodic for voiced • u[n] close to random for unvoiced • Estimating (bk , nk) is a difficult problem • k
Multipulse LPC • Estimating (bk , nk) is a difficult problem • analysis by synthesis • large amount of computation is the price paid for better speech quality
Multipulse LPC • Perceptual Weighting • W(z) = (1ak z -k)/(1akck z -k) • 0 < c < 1 for perceptual sensitivity • W(z) = 1 , if c = 1 • W(z) = 1ak z -k , if c = 0 • practically c 0.8 • Error Evaluation • E = |X() X() |2 W()d • P k = 1 • P k = 1 ^
Multipulse LPC • Error Minimization and Pulse search • u[n] = bk[nnk] • x[n] = bkg[nnk] • E = E ( b1 , n1 ,b2 , n2……) • sub-optional solution finding 1 pulse at a time • k ^ • k
Code-Excited Linear Prediction (CELP) • Use of VQ to Construct a Codebook of Excitation Sequences • a sequence consists of roughly 40 samples • a codebook of 512 ~ 1024 patterns is constructed with VQ • roughly 512 ~ 1024 excitation patterns are perceptually adequate • Excitation Search –analysis by synthesis • 9 ~ 10 bits are needed for the excitation of 40 samples , while {} parameters in G(z) also vector quantized
Code-Excited Linear Prediction (CELP) • Receiver • {} codewords can be transmitted less frequently than excitation codeword • Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33
4.0Speech Recognition and Voice-based Network Access
Speech Recognition as a pattern recognition problem W X x(t) Feature Extraction Pattern Matching Decision Making unknown speech signal output word feature vector sequence y(t) Y Reference Patterns Feature Extraction training speech
Basic Approach for Large Vocabulary Speech Recognition Input Speech Feature Vectors Linguistic Decoding and Search Algorithm Output Sentence Front-end Signal Processing Language Model Acoustic Model Training Speech Corpora Acoustic Models Text Corpora Language Model Construction Lexicon • A Simplified Block Diagram • Example Input Sentence • this is speech • Acoustic Models (聲學模型) • (th-ih-s-ih-z-s-p-ih-ch) • Lexicon (th-ih-s) → this • (ih-z) → is • (s-p-iy-ch) → speech • Language Model(語言模型)(this) – (is) – (speech) • P(this) P(is | this) P(speech | this is) • P(wi|wi-1) bi-gramlanguage model • P(wi|wi-1,wi-2) tri-gram language model,etc
State Transition Probabilities 1-dim Gaussian Mixtures
Simplified HMM RGBGGBBGRRR……
N-gram W1W2W3 W4 W5W6 ......WR tri-gram W1W2W3W4W5W6 ......WR