Speech Coding Techniques

Speech Coding Techniques 潘奕誠 4/7/2003

Introduction • Efficient speech-coding techniques • Advantages for VoIP • Digital streams of ones and zeros • The lower the bandwidth, the lower the quality • RTP payload types • Processing power • The better quality (for a given bandwidth) uses a more complex algorithm • A balance between quality and cost

Voice Quality • Bandwidth is easily quantified • Voice quality is subjective • MOS, Mean Opinion Score • ITU-T Recommendation P.800 • Excellent – 5 • Good – 4 • Fair – 3 • Poor – 2 • Bad – 1 • A minimum of 30 people • Listen to voice samples or in conversations

P.800 recommendations • The selection of participants • The test environment • Explanations to listeners • Analysis of results • Toll quality • A MOS of 4.0 or higher

About Speech • Speech • Air pushed from the lungs past the vocal cords and along the vocal tract • The basic vibrations – vocal cords • The sound is altered by the disposition of the vocal tract ( tongue and mouth) • Model the vocal tract as a filter • The shape changes relatively slowly • The vibrations at the vocal cords • The excitation signal

Speech sounds • Voiced sound • The vocal cords vibrate open and close • Quasi-periodic pulses of air • The rate of the opening and closing – the pitch • Unvoiced sounds • Forcing air at high velocities through a constriction • Noise-like turbulence • Show little long-term periodicity • Short-term correlations still present • Plosive sounds • A complete closure in the vocal tract • Air pressure is built up and released suddenly

Voice Sampling • Discrete Time LTI Systems: The Convolution Sum 1 h[n] 0 1 2 n 2.5 2 2 x[n] y[n] 0.5 0.5 0 1 n 0 1 2 3 n

Nyquist sampling theorem

Quantization (Scalar Quantization) v1 vk+1 vL v2 m1 m0= -A m2 …… mk mk+1 mL1 mL=A Jk+1 ·Assume | x[n] |  A divide the range [ A , A ] into L quantization levels { J1 , J2 , …… Jk ,….. JL } Jk : [mk-1,mk ] L = 2R each quantization level Jk is represented by a value vk S = U Jk , V = { v1 , v2 , ……vk ,….. vL }

m0 = -A m1 m2 …… 0 mL=A Non-Uniform Quantization Concept : small quantization levels for small x large quantization levels for large x Goal: constant SNRQ for all x

^ x[n] x[n] Uniform Quantization F(x) Uniform Decoder F1(x) Companding Compressor …1101…1101… Expandor Compressor + Expandor  Compandor F(x) is to specify the non-uniform quantization characteristics

Non-Uniform Quantization • -law • A-law • Typical values in practice •  = 255 , A = 87.6

Types of Speech Codecs • Waveform codecs,source codecs (also known as vocoders),and hybrid codecs.

G(z) = 1 1akz-k P k = 1 Speech Source Model and Source Coding G(z), G(), g[n] unvoiced G v/u voiced N Excitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain  excitation signal u[n] random sequence generator u[n] x[n]  periodic pulse train generator Vocal Tract Model Vocal Tract parameters {ak} : LPC coefficients formant structure of speech signals Excitation A good approximation, though not precise enough

receiver x[n] { ak } N , G v/u Decoder g[n] G(z) Ex …11011… LPC Vocoder(Voice Coder) x[n] { ak } N , G v/u LPC Analysis Encoder …11011… N by pitch detection v/u by voicing detection {ak} can be non-uniform or vector quantized to reduce bit rate further

G.711 • The most commonplace codec • Used in circuit-switched telephone network • PCM, Pulse-Code Modulation • If uniform quantization • 12 bits * 8 k/sec = 96 kbps • Non-uniform quantization • 65 kbps DS0 rate • North America • A-law • Other countries, a little friendlier to lower signal levels • An MOS of about 4.3

ADPCM(adaptive differential PCM) • DPCM and ADPCM. • ADPCM : Adaptive Prediction in DPCM Adaptive Quantization Adaptive Quantization • Quantization level  varies with local signal level • [n] = ax[n] • x[n] : locally estimated standard deviation of x[n] • G.721:ADPCM-coded speech at 32Kbps. • G.726(A-law or ) • 16,24,32,40Kbps • MOS 4.0 , at 32Kbps

Analysis-by-Synthesis (AbS) Codecs • Hybrid codec • Fill the gap between waveform and source codecs • The most successful and commonly used • Time-domain AbS codecs • Not a simple two-state, voiced/unvoiced • Different excitation signals are attempted • Closest to the original waveform is selected • MPE, Multi-Pulse Excited • RPE, Regular-Pulse Excited • CELP, Code-Excited Linear Predictive

G.728 LD-CELP • CELP codecs • A filter; its characteristics change over time • A codebook of acoustic vectors • A vector = a set of elements representing various char. of the excitation • Transmit • Filter coefficients, gain, a pointer to the vector chosen • Low Delay CELP • Backward-adaptive coder • Use previous samples to determine filter coefficients • Operates on five samples at a time • Delay < 1 ms • Only the pointer is transmitted

1024 vectors in the code book • 10-bit pointer (index) • 16 kbps • LD-CELP encoder • Minimize a frequency-weighted mean-square error

LD-CELP decoder • An MOS score of about 3.9 • One-quarter of G.711 bandwidth

G.723.1 ACELP • 6.3 or 5.3 kbps • Both mandatory • Can change from one to another during a conversation • The coder • A band-limited input speech signal • Sampled at 8 KHz, 16-bit uniform PCM quantization • Operate on blocks of 240 samples at a time • A look-ahead of 7.5 ms • A total algorithmic delay of 37.5 ms + other delays • A high-pass filter to remove any DC component

G.723.1 Annex A • Silence Insertion Description (SID) frames of size four octets • The two lsbs of the first octet • 00 6.3kbps 24 octets/frame • 01 5.3kbps 20 • 10 SID frame 4 • An MOS of about 3.8 • At least 37.5 ms delay

G.729 • 8 kbps • Input frames of 10 ms, 80 samples for 8 KHz sampling rate • 5 ms look-ahead • Algorithmic delay of 15 ms • An 80-bit frame for 10 ms of speech • A complex codec • G.729.A (Annex A), a number of simplifications • Same frame structure • Encoder/decoder, G.729/G.729.A • Slightly lower quality

G.729.B • VAD, Voice Activity Detection • Based on analysis of several parameters of the input • The current frames plus two preceding frames • DTX, Discontinuous Transmission • Send nothing or send an SID frame • SID frame contains information to generate comfort noise • CNG, Comfort Noise Generation • G.729, an MOS of about 4.0 • G.729A an MOS of about 3.7

Other Codecs • CDMA QCELP defined in IS-733 • Variable-rate coder • Two most common rates • The high rate, 13.3 kbps • A lower rate, 6.2 kbps • Silence suppression • For use with RTP, RFC 2658

GSM Enhanced Full-Rate (EFR) • GSM 06.60 • An enhanced version of GSM Full-Rate • ACELP-based codec • The same bit rate and the same overall packing structure • 12.2 kbps • Support discontinuous transmission • For use with RTP, RFC 1890

GSM Adaptive Multi-Rate (AMR) codec • GSM 06.90 • Eight different modes • 4.75 kbps to 12.2 kbps • 12.2 kbps, GSM EFR • 7.4 kbps, IS-641 (TDMA cellular systems) • Change the mode at any time • Offer discontinuous transmission • The coding choice of many 3G wireless networks

The MOS values are for laboratory conditions • G.711 does not deal with lost packets • G.729 can accommodate a lost frame by interpolating from previous frames • But cause errors in subsequent speech frames • Processing Power • G.728 or G.729, 40 MIPS • G.726 10 MIPS

Cascaded Codecs • E.g., G.711 stream -> G.729 encoder/decoder • Might not even come close to G.729 • Each coder only generate an approximate of the incoming signal

Tones, Signal, and DTMF Digits • The hybrid codecs are optimized for human speech • Other data may need to be transmitted • Tones: fax tones, dialing tone, busy tone • DTMF digits for two-stage dialing or voice-mail • G.711 is OK • G.723.1 and G.729 can be unintelligible • The ingress gateway needs to intercept • The tones and DTMT digits • Use an external signaling system

Easy at the start of a call • Difficult in the middle of a call • Encode the tones differently form the speech • Send them along the same media path • An RTP packet provides the name of the tone and the duration • Or, a dynamic RTP profile; an RTP packet containing the frequency, volume and the duration • RFC 2198 • An RTP payload format for redundant audio data • Sending both types of RTP payload

RTP Payload Format for DTMF Digits • An Internet Draft • Both methods described before • A large number of tones and events • DTMF digits, a busy tone, a congestion tone, a ringing tone, etc. • The named events • E: the end of the tone, R: reserved

Payload format

Finis

Discrete Time LTI Systems: The Convolution Sum 1 h[n] 0 1 2 n 2.5 2 2 x[n] y[n] 0.5 0.5 0 1 n 0 1 2 3 n

Frequency-Domain Representation of Sampling

Speech Source Model and Source Coding • Vocal Tract Model

Speech Coding Techniques