Speech-Coding Techniques

Speech-Coding Techniques Chapter 3

Introduction • Efficient speech-coding techniques • Advantages for VoIP • Digital streams of ones and zeros • The lower the bandwidth, the lower the quality • RTP payload types • Processing power • The better quality (for a given bandwidth) uses a more complex algorithm • A balance between quality and cost

Voice Quality • Bandwidth is easily quantified • Voice quality is subjective • MOS, Mean Opinion Score • ITU-T Recommendation P.800 • Excellent – 5 • Good – 4 • Fair – 3 • Poor – 2 • Bad – 1 • A minimum of 30 people • Listen to voice samples or in conversations

P.800 recommendations • The selection of participants • The test environment • Explanations to listeners • Analysis of results • Toll quality • A MOS of 4.0 or higher

Subjective and objective quality-testing techniques • PSQM – Perceptual Speech Quality Measurement • ITU-T P.861 • faithfully represent human judgement and perception • algorithmic comparison between the output signal and a know input • type of speaker, loudness, delay, active/silence frames, clipping, environmental noise

A Little About Speech • Speech • Air pushed from the lungs past the vocal cords and along the vocal tract • The basic vibrations – vocal cords • The sound is altered by the disposition of the vocal tract ( tongue and mouth) • Model the vocal tract as a filter • The shape changes relatively slowly • The vibrations at the vocal cords • The excitation signal

Speech sounds • Voiced sound • The vocal cords vibrate open and close • Interrupt the air flow • Quasi-periodic pluses of air • The rate of the opening and closing – the pitch • A high degree of periodicity at the pitch period • 2-20 ms

Voiced speech Power spectrum density

Unvoiced sounds • Forcing air at high velocities through a constriction • The glottis is held open • Noise-like turbulence • Show little long-term periodicity • Short-term correlations still present

unvoiced speech Power spectrum density

Plosive sounds • A complete closure in the vocal tract • Air pressure is built up and released suddenly • A vast array of sounds • The speech signal is relatively predictable over time • The reduction of transmission bandwidth can be significant

Voice Sampling • A-to-D • discrete samples of the waveform and represent each sample by some number of bits • A signal can be reconstructed if it is sampled at a minimum of twice the maximum freq. • Human speech • 300-3800 Hz • 8000 samples per second Each sample is encoded into an 8-bit PCM code word (e.g. 01100101) time => 8000 x 8 bit/s

Quantization • How many bits is used to represent • Quantization noise • The difference between the actual level of the input analog signal • More bits to reduce • Diminishing returns • Uniform quantization levels • Louder talkers sound better • 11.2/11 v.s. 2.2/2

Non-uniform quantization • Smaller quantization steps at smaller signal levels • Spread signal-to-noise ratio more evenly

DTX and Comfort Noise • DTX is Discontinuous Transmission • Voice activity detector (VAD) detects if there is active speech or not. • When there is no active speech different DTX procedures can be used: • No Transmission at all • Comfort Noise (CN) using RFC 3389 • Codec built CN in like AMR SID (Silence Descriptor) • Frequency of Comfort Noise packets varies but is usually some fraction of normal packet rate

Type of Speech Coders • Waveform codecs • Sample and code • High-quality and not complex • Large amount of bandwidth • source codecs (vocoders) • Match the incoming signal to a math model • Linear-predictive filter model of the vocal tract • A voiced/unvoiced flag for the excitation • The information is sent rather than the signal • Low bit rates, but sounds synthetic • Higher bit rates do not improve much

Hybrid codecs • Attempt to provide the best of both • Perform a degree of waveform matching • Utilize the sound production model • Quite good quality at low bit rate

G.711 • The most commonplace codec • Used in circuit-switched telephone network • PCM, Pulse-Code Modulation • If uniform quantization • 12 bits * 8 k/sec = 96 kbps • Non-uniform quantization • 64 kbps DS0 rate • mu-law • North America • A-law • Other countries, a little friendlier to lower signal levels • An MOS of about 4.3

DPCM • DPCM, Differential PCM • Only transmit the difference between the predicated value and the actual value • Voice changes relatively slowly • It is possible to predict the value of a sample base on the values of previous samples • The receiver perform the same prediction • The simplest form • No prediction • No algorithmic delay

ADPCM • ADPCM, Adaptive DPCM • Predicts sample values based on • Past samples • Factoring in some knowledge of how speech varies over time • The error is quantized and transmitted • Fewer bits required • G.721 • 32 kbps • G.726 • A-law/mu-law PCM -> 16, 24, 32, 40 kbps • An MOS of about 4.0 at 32 kbps

Analysis-by-Synthesis (AbS) Codecs • Hybrid codec • Fill the gap between waveform and source codecs • The most successful and commonly used • Time-domain AbS codecs • Not a simple two-state, voiced/unvoiced • Different excitation signals are attempted • Closest to the original waveform is selected • MPE, Multi-Pulse Excited • RPE, Regular-Pulse Excited • CELP, Code-Excited Linear Predictive

G.728 LD-CELP • CELP codecs • A filter; its characteristics change over time • A codebook of acoustic vectors • A vector = a set of elements representing various char. of the excitation • Transmit • Filter coefficients, gain, a pointer to the vector chosen • Low Delay CELP • Backward-adaptive coder • Use previous samples to determine filter coefficients • Operates on five samples at a time • Delay < 1 ms • Only the pointer is transmitted

1024 vectors in the code book • 10-bit pointer (index) • 16 kbps • LD-CELP encoder • Minimize a frequency-weighted mean-square error

LD-CELP decoder • An MOS score of about 3.9 • One-quarter of G.711 bandwidth

G.723.1 ACELP • 6.3 or 5.3 kbps • Both mandatory • Can change from one to another during a conversation • The coder • A band-limited input speech signal • Sampled at 8 KHz, 16-bit uniform PCM quantization • Operate on blocks of 240 samples at a time • A look-ahead of 7.5 ms • A total algorithmic delay of 37.5 ms + other delays • A high-pass filter to remove any DC component

Various operations to determine the appropriate filter coefficients • 5.3 kbps, Algebraic Code-Excited Linear Prediction • 6.3 kbps, Multi-pulse Maximum Likelihood Quantization • The transmission • Linear predication coefficients • Gain parameters • Excitation codebook index • 24-octet frames at 6.3 kbps, 20-octet frames at 5.3 kbps

G.723.1 Annex A • Silence Insertion Description (SID) frames of size four octets • The two lsbs of the first octet • 00 6.3kbps 24 octets/frame • 01 5.3kbps 20 • 10 SID frame 4 • An MOS of about 3.8 • At least 27.5 ms delay

G.729 • 8 kbps • Input frames of 10 ms, 80 samples for 8 KHz sampling rate • 5 ms look-ahead • Algorithmic delay of 15 ms • An 80-bit frame for 10 ms of speech • A complex codec • G.729.A (Annex A), a number of simplifications • Same frame structure • Encoder/decoder, G.729/G.729.A • Slightly lower quality

G.729.B • VAD, Voice Activity Detection • Based on analysis of several parameters of the input • The current frames plus two preceding frames • DTX, Discontinuous Transmission • Send nothing or send an SID frame • SID frame contains information to generate comfort noise • CNG, Comfort Noise Generation • G.729, an MOS of about 4.0 • G.729A an MOS of about 3.7

G.729 Annex D • a lower-rate extension • 6.4 kbps; 10 ms speech samples, 64 bits/frame • MOS  6.3 kbps G.723.1 • G.729 Annex E • a higher bit rate enhancement • the linear prediction filter of G.729 has 10 coef. • that of G.729 Annex E has 30 coef. • the codebook of G.729 has 35 bits • that of G.729 Annex E has 44 bits • 118 bits/frame; 11.8 kbps

Other Codecs • CDMA QCELP defined in IS-733 • Variable-rate coder • Two most common rates • The high rate, 13.3 kbps • A lower rate, 6.2 kbps • Silence suppression • For use with RTP, RFC 2658

GSM Enhanced Full-Rate (EFR) • GSM 06.60 • An enhanced version of GSM Full-Rate • ACELP-based codec • The same bit rate and the same overall packing structure • 12.2 kbps • Support discontinuous transmission • For use with RTP, RFC 1890

GSM Adaptive Multi-Rate (AMR) codec • 20 ms coding delay • Eight different modes • 4.75 kbps to 12.2 kbps • 12.2 kbps, GSM EFR • 7.4 kbps, IS-641 (TDMA cellular systems) • Change the mode at any time • Offer discontinuous transmission • The SID (Silence Descriptor) is sent in every 8th frame and is 5 bytes in size • The coding choice of many 3G wireless networks

The MOS values are for laboratory conditions • G.711 does not deal with lost packets • G.729 can accommodate a lost frame by interpolating from previous frames • But cause errors in subsequent speech frames • Processing Power • G.728 or G.729, 40 MIPS • G.726 10 MIPS

iLBC • a FREE codec for robust VoIP • 13.33 kbit/s with an encoding frame length of 30 ms and 15.20 kbps of 20 ms • Computational complexity in a range of G.729A

Speex • Open-source patent-free speech codec • CELP (code-excited linear prediction) codec • operating modes: • narrowband (8 kHz sampling rate) • 2.15 – 24.6 kb/s • delay of 30 ms • wideband (16 kHz sampling rate) • 4-44.2 kb/s • delay of 34 ms • ultra-wideband (32 kHz sampling rate) • intensity stereo encoding • variable bit rate (VBR) possible • voice activity detection (VAD)

Cascaded Codecs • E.g., G.711 stream -> G.729 encoder/decoder • Might not even come close to G.729 • Each coder only generate an approximate of the incoming signal • Audio samples • http://www.cs.columbia.edu/~hgs/audio/codecs.html

Effects of packetization

Tones, Signal, and DTMF Digits • The hybrid codecs are optimized for human speech • Other data may need to be transmitted • Tones: fax tones, dialing tone, busy tone • DTMF digits for two-stage dialing or voice-mail • G.711 is OK • G.723.1 and G.729 can be unintelligible • The ingress gateway needs to intercept • The tones and DTMF digits • Use an external signaling system

Easy at the start of a call • Difficult in the middle of a call • Encode the tones differently from the speech • Send them along the same media path • An RTP packet provides the name of the tone and the duration • Or, a dynamic RTP profile; an RTP packet containing the frequency, volume and the duration • RFC 2198 • An RTP payload format for redundant audio data • Sending both types of RTP payload

RTP Payload Format for DTMF Digits • An Internet Draft • Both methods described before • A large number of tones and events • DTMF digits, a busy tone, a congestion tone, a ringing tone, etc. • The named events • E: the end of the tone, R: reserved

Payload format

Speech-Coding Techniques