Introduction to Voice Compression VC Lab 2009

Introduction to Voice CompressionVC Lab 2009 MC 2009, VCLab

Outline • Why digitize voice? • Introduction • Speech properties • Performance measurement • The channel vocoders and LPC-10 • Sinusoidal coders • Analysis-by-synthesis linear predictive coders and G.723.1 • Quality of service in voice over IP (VoIP) MC 2009, VCLab

Why Digitize Voice? • When voice is digitized: • Multiplexing is easier • Signaling is easier • PCs and other computers can be used • The voice switch becomes a big computer, and all lines are digital • Lines are not as noisy • Line quality can be monitored closely • New services can be provided • Lines are more tolerant noise • Encryption is possible • Three steps to digits: sampling, quantization, and coding • PCM, DPCM, DM, ADPCM, ADM MC 2009, VCLab

Introduction • Samples: 4 kHz, 8k samples • Bit rate • original (PCM): 64kbits/s • medium-rate: 8-16 kbits/s • low-rate: 2.4-8 kbits/s • very-low-rate: < 2.4 kbits/s • The speech model: The speech is produced by forcing air first through an elastic opening, the vocal chords, and then through the laryngeal, oral, nasal, and pharynx passages, and finally through the mouth and the cavity. • Analysis-synthesis process: open-loop and closed-loop MC 2009, VCLab

Speech Compression Techniques • Parametric representations: speech or non-speech specific • non-speech specific (waveform) coder: faithfully reconstruct the time-domain waveform. • speech specific (voice) coder: rely on speech models and focused upon producing perceptually intelligible speech without necessarily matching the waveform. • Waveform coding methods • The channel vocoder methods • Sinusoidal analysis-synthesis methods • Analysis-by-synthesis linear predictive methods MC 2009, VCLab

聲門 • No matter what language is being spoken, the speech is generated using machinery what is not very different from person to person. • This machinery has to obey certain physical laws that substantially limit the behavior of outputs. MC 2009, VCLab

Speech Properties • Speech signals are non-stationary and at best they can be considered as quasi-stationary over short segments, typically 5-20 ms. • Voiced speech is quasi-periodic in the time-domain and harmonically structured in the frequency-domain while unvoiced speech is random-like and broadband. The energy of voiced segments is generally high than that of unvoiced segments. • The short-time spectrum of voiced speech is characterized by its fine and formant structure. The fine harmonic structure is a consequence of the quasi-periodicity of speech and may be attributed to the vibrating vocal chords. The formant structure (spectral envelope) is due to the interaction of the source and the vocal tract. The vocal tract consists of the pharynx and the mouth cavity. MC 2009, VCLab

unvoiced speech voiced speech formant structure quasi-periodicity MC 2009, VCLab

The sound /e/ in test. The sound /s/ in test. MC 2009, VCLab

The shape of the spectral envelope that "fits" the short-time spectrum of voiced speech, is associated with the transfer characteristics of the vocal tract and the spectral tilt (6 dB/octave) due to the glottal pulse. • The spectral envelope is characterized by a set of peaks, which are called formants. The formants are the resonant modes of the vocal tract. • For the average vocal tract, there are three to five formants below 5 kHz. The amplitudes and locations of the first three formants, usually occurring below 3 kHz, are quite important in both speech synthesis and perception. • Higher formants are also important for wideband and unvoiced speech representations. • The properties of speech are related to the physical speech production system. Voiced exciting periodic glottal air pulses generated by the vibrating vocal chords. The frequency of the periodic pulses is referred to as the fundamental frequency or pitch. • Unvoiced speech is produced by forcing air through a constriction in the vocal tract. Nasal sounds (e.g., /n/) are due to the acoustical coupling of the nasal tract to the vocal tract, and plosive sounds (e.g., /p/) are produced by abruptly releasing air pressure which was built up behind a closure in the tract. MC 2009, VCLab

Historical Prospective • The first analysis-synthesis method [Dudley 1939]: Analyze speech in terms of its pitch and spectrum, and synthesize it by exciting a bank of ten analog band-pass filters (representing the vocal tract) with periodic (buzz, voiced) orrandom (hiss, unvoiced) excitation. MC 2009, VCLab

Pulse Code Modulation (PCM), Differential PCM, Adaptive DPCM • Linear speech source-system production model [Fant, 1960] • Linear prediction analysis: a process where the present speech sample is predicated by the linear combination of previous samples. • Homomorphic analysis: a method that can be used for separating signal that has been combined by convolution. MC 2009, VCLab

Short-Time Fourier transform (STFT) [Flanagan and Golden]: analysis-synthesis of speech using STFT. • Transform coding, sub-band coding • sinusoidal analysis-synthesis of speech [McAulay and Quatieri] • multiband excitation vocoders [Griffin and Lim] • multi-pulse and vector excitation schemes for LPC [Atal et al] • vector quantization (VQ) [Gersho and Gray] • Vector quantization proved to be very useful in encoding LPC parameters. • Code Excited Linear Prediction (CELP): Atal and Schroeder proposed a linear prediction algorithm with stochastic vector excitation. The stochastic excitation in CELP is determined using a perceptually weighted closed-loop (analysis-by-synthesis) optimization. • Group Speciale Mobile (GSM): a standard that uses a 13 kbits/s regular pulse excitation algorithm. MC 2009, VCLab

Performance Measurement • Issues: the bit rate, the quality of reconstructed speech, the complexity of the algorithm, the delay introduced. • Four kinds of speech quality: broadcast (~64 kbits/s), network or toll (~16 kbits/s), communications (~4.8 kbits/s), and synthetic (< 4.0 kbits/s). • SNR ratio and segmental SNR (SEGSNR, computing the SNR for each N-point segment) • Perceptual criteria: Diagnostic Rhyme Test (DRT), Diagnostic Acceptability Measure (DAM), and Mean Opinion Score (MOS) are based on listener ratings. • MOS: involves 12 to 24 listeners who are instructed to rate phoneticallybalanced records according to a 5-level quality scale. Excellent speech quality implies that coded speech is indistinguishable from the original and without perceptible noise. MC 2009, VCLab

iLBC iLBC 13.3 (or 15.2) Skype MC 2009, VCLab

Source: Speech Coding: A Tutorial Review MC 2009, VCLab

Outline • Why digitize voice? • Introduction • Speech properties • Performance measurement • The channel vocoders and LPC-10 • Analysis-by-synthesis linear predictive coders and G.723.1 • Quality of service in voice over IP (VoIP) • iLBC MC 2009, VCLab

The Channel Vocoder: the original • In the channel vocoder, each segment of input speech is analyzed using a bank of band pass filters called the analysis filter. • The energy at the output of each filter is estimated at fixed intervals. • A decision is made to whether the speech is voiced or unvoiced. • The period of the fundamental harmonic is called the pitch period. • It matches the frequency profile of the input speech MC 2009, VCLab

The Channel Vocoder Receiver MC 2009, VCLab

The Linear Predictive Coder • Instead of the vocal tract being modeled by a bank of filters, it is modeled as a signal linear filter whose output yn is related to the input n.by where G is called the gain of the filter. • The input to the vocal tract filter is either the output of a random noise generator or a periodic pulse generator. MC 2009, VCLab

The Model for Speech Synthesis MC 2009, VCLab

LPC-10 • 8000 samples per second, 180 samples segments, corresponding to 22.5 ms. • The V/U decision: energy and # of zero crossings. • The voicing decision of the neighboring frames is considered to avoid single voiced frame located. • Estimating the pitch period: average magnitude difference function (AMDF), MC 2009, VCLab

AMDF function for the sound /e/ in test. AMDF function for the sound /s/ in test. MC 2009, VCLab

AMDF • In voiced signal not only do we have a minimumwhen P equals the pitch period, but the difference between the minimum and average values is quite small. • We do not have to evaluate the AMDF for all possible values of P. The pitch period is between 2.5 and 19.5 ms, it is between 20 and 160 for 8000 samples per second. MC 2009, VCLab

Vocal Tract Filter (1) • In analysis phase, the filter coefficients that best match the segment being analyzed in the mean squared error sense is calculated • M equations MC 2009, VCLab

Vocal Tract Filter (2) • In order to solve (*), we need to be able to estimate E[yn-iyn-j]. Two methods: autocorrelation and autocovariance are used now. • In autocorrelationapproach, we assume that the {yn} sequence isstationary and therefore E[yn-iyn-j] = Ryy[|i-j|). We also assume that {yn} sequence is zero outside the segment. Thus, the aurocorrelation function is estimated as MC 2009, VCLab

Vocal Tract Filter (3) • M equations can be written in matrix form as RA = P where MC 2009, VCLab

Vocal Tract Filter (4) • The matrix equation can be solved directly to find the filter coefficients R-1A = P. • Note that R isToeplitz, we can obtain a recursive solution that is computationally very efficient,the Levinson-Durbin algorithmis the most efficient one. • The assumption of stationary is not valid for speech signals. Discard this assumption, the equations change. The term E[yn-iyn-j] is now a function of both i and j. MC 2009, VCLab

Vocal Tract Filter (5) • The matrix equation becomes CA = S, where cij = E[yn-iyn-j] . • The element are estimated as • Note that R issymmetric but no longer Toeplitz. The equations are generally solved by the Cholesky decomposition. • LPC-10 uses the covariance method. If both the first two coefficients have very small values, the voicing decision is unvoiced. MC 2009, VCLab

Transmitting the Parameters • voicing decision: 1 bit • pitch period: quantized to 1 to 60 different values using a log quantizer. • vocal tract filter parameters: 10th-order filter for voiced speech and a 4th-order for unvoiced speech. • gainG: finding the root mean squared (rms) value of the segment and quantized using 5-bit log quantization. • Totally 54 bits per frame, 2400 bits per second. MC 2009, VCLab

Synthesis, LPC-10 • The voiced framesare generated by exciting the received vocal filter by a locally stored waveform. • This waveform is 40 samples long. It is truncated or padded with zeros depending on the pitch period. • If the frame is unvoiced, the vocal tract is excited by a pseudorandom number generator. • The use of only two kinds of excitation signals gives an artificial quality voice. This approach also suffers when used in noisy environments. MC 2009, VCLab

Excitation Signals • The most important factor in generating natural-sounding speech is the excitation signal. • Solutions: Code-excited LP (CELP), the sinusoidal coder and Multi-pulse LP coder (MP-LPC), …etc. • The CELP makes use of a codebook of excitation signals. • The sinusoidal coders make use of an excitation signal that is the sum of sine waves of arbitrary amplitudes, frequencies, and phases. • Standards: CELP: FS 1016, G.728, G.729, MP-LPC: G.723.1, RPE-LTP: GSM MC 2009, VCLab

G.728 • A CELP coder with a coder delay of 2 ms operating at 16 kbps. • To lower the coding delay, the size of each segment has to be reduced significantly; G.728 uses a segment of five sample. • G.728 does away the pitch filter, instead it uses a 50th-order vocal tract filter. The algorithm obtains the vocal tract filter parameters in backward adaptive manner They are updated every fourth frame, 20 samples. • 10 bits for coding the excitation signal, where 3 bits are used to encode the gain using a predictive encoding scheme, and 7 bits for the codebook index. MC 2009, VCLab

G.728 16 kbits Speech Encoders MC 2009, VCLab

G.728 16 kbits Speech Decoders MC 2009, VCLab

Mixed Excitation Linear Prediction (MELP) The excitation signal is no longer simply noise or a periodic pulse but a multi-band mixed excitation. MC 2009, VCLab

Analysis-by-Synthesis VS. Analysis-and-Synthesis • AaS methods perform qualified quality at bit rate 9.6-16 kb/s, however, due to the lack of feedback control mechanism it could not avoid the phenomenon of error propagation. • AbS (closed-loop analysis): the parameters are extracted and encoded by minimizing explicitly a measure of the difference between the original and the current reconstructed speech. • AbS-LPC：It consists of timing-varying filter, excitation signal process and perceptually based minimization mechanism. The timing-varying filter is composed of LPC filter (1/A(z)) and long-term prediction filter (Adapt. CB); Excitation signal processing makes use of fixed CB or multi-pulse. • LTP captures the redundancies of pitch frequency which are longer term. MC 2009, VCLab

Analysis-by-Synthesis LPC MC 2009, VCLab

LTP Residual Signal MC 2009, VCLab

MC 2009, VCLab

Variables • y[n]: source signal, 4 frames (16 sub-frames) • t[n]: perceptual weighted signal • a[n]: the output of adaptive codebook (long-term predication) • r[n]: t[n] - a[n], residual signals predicated by fixed codebook MC 2009, VCLab

Main Steps • Initialization for LPC filter and long-term prediction filter: 0 or random numbers • LPC analysis on a source frame, y[n]: LPC coefficients • Divide into several sub-frames, for each sub-frame: • Compute the long-term prediction filter (LTP coefficients) using close-loop and get the residual signal r[n] = t[n] - a[n] • Find one approximate excitation signal for this residual signal, predicated by the fixed codebook MC 2009, VCLab

G.723.1 (high-rate) for Signal ‘a’ MC 2009, VCLab

G.723.1 (high-rate) for Signal ‘sh‘ t[n] is similar to r[n], in this case excitation signal is approximately generated by the fixed codebook MC 2009, VCLab

G.723.1 (high-rate) for signal ('he' to 'i' in 'she is') MC 2009, VCLab

MP-LPC Approach #(odd,even) * #(position) * #(sign) * #(gain) = 2 * C(30, 6) * 26 * 24 = 1,824,076,800 MC 2009, VCLab

G.723.1 MP-MLQ • Sub-optimal pulse-by-pulse sequential search • Select the candidate pulse position by using cross correlation function between residual signal and impulse response • Using the first pulse to determine the possible gain value (-3.2db, +0db, +3.2db, +6.4db) • Only 8 candidate excitation signal selected MC 2009, VCLab

MP-MLQ Coding Result MC 2009, VCLab

G.723.1 MP-MLQ MC 2009, VCLab

Introduction to Voice Compression VC Lab 2009

Introduction to Voice Compression VC Lab 2009

Presentation Transcript

Introduction to Packet Voice Technologies

Introduction to Packet Voice Technologies

Site Coordinator Training: VC by VC January 22, 2009

Introduction to VC++.NET

Introduction to Emu lab

Multimedia Data Introduction to Data Compression and Lossless Compression

An introduction to Data Compression

Introduction to Dendrochronology Lab

Introduction to Voice Conversion

Introduction to Voice Thread voicethread/

Introduction to Kalabie Electronic Lab Notebook May 2009

Introduction to Voice

Introduction to Voice over ATM

Introduction to Video Compression Techniques

Introduction to IMT Lab.

Introduction to IMT Lab.

Introduction to Wimba Voice Tools

Global Minimally Invasive Vertebral Compression Fracture (VC

Beginner’s Introduction To Compression Limiters

Introduction to Data Compression

Introduction to Voice Compression VC Lab 2009

Introduction to Packet Voice Technologies