Speech & Audio Coding

Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg

Outline • Part I - Speech • Speech • History of speech synthesis & coding • Speech coding methods • Part II – Audio • Psychoacoustic models • MPEG-4 Audio

Speech Production • The human’s vocalapparatusconsists of: • lungs • trachea (wind pipe) • larynx • contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through • oral tract • nasal tract

The Speech Signal 1

The Speech Signal

The Speech Signal • Elements of the speech signal: • spectral resonance (formants, moving) • periodic excitation (voicing, pitched) + pitch contour • noise excitation (fricatives, unvoiced, no pitch) • transients (stop-release bursts) • amplitude modulation (nasals, approximants) • timing

The Speech Signal Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels. F1-F2 chart Formant positions

VODER – the architecture History of Speech Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by • Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis-by-synthesis system developed • by Homer Dudley of AT&T labs - VODER

History of Speech Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by • Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis - by - synthesis system developed • by Homer Dudley of AT&T labs - VODER

OVE formant synthesis (Gunnar Fant,KTH), 1953

History of Speech - Coding • 1926 - PCM - first conceived by Paul M. Rainey and independently by • Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 • 1939 - Channel vocoder - first analysis - by - synthesis system Homer • Dudley of AT&T labs - VODER • 1952 - delta modulation proposed, differential PCM invented • 1957 - -law encoding proposed (standardised for telephone network • in 1972 (G.711)) • 1974 - ADPCM developed • 1984 - CELP vocoder proposed (majority of coding standards for speech • signal today use a variation on CELP)

Source-filter Model of Speech Production • Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract. • The gain controls Av and AN determine the intensity of voiced and unvoiced excitation. • The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs). • This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles.

Speech Coding Strategies 1. PCM • Invented 1926, deployed 1962. • The speech signal is sampled at 8 kHz. • Uniform quantization requires >10 bits/sample. • Non-uniform quantization (G.711, 1972) • Quantizing y to 8 bits -> 64 kbit/s.

Speech Coding Strategies 2. Adaptive DPCM • Example: G.726 (1974) • Adaptive predictor based on six previous differences. • Gain-adaptive quantizer with 15 levels ) 32 kbit/s.

Speech Coding Strategies 3. Model-based Speech Coding • Advanced speech coders are based on models of how speech is produced: Excitationsource Vocaltract

An Excitation Source Noisegenerator Pulsegenerator Pitch

g1 g2 gn BP BP BP Vocal Tract Filter 1: A Fixed Filter Bank

Vocal Tract Filter 2: A Controllable Filter

Linear Predictive Coding (LPC) • The controllable filter is modelled as yn =  ai yn-i + Gnwhere n is the input signal and yn is the output. • We need to estimate the vocal tract parameters (ai and G) and the exciatation parameters (pitch, v/uv). • Typically the source signal is divided in short segments and the parameters are estimated for each segment. • Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).

Typical Scheme of an LPC Coder Noisegenerator Vocal tractfilter Pulsegenerator Pitch Filter coeffs v/uv Gain

Estimating the Parameters • v/uv estimation • Based on energy and frequency spectrum. • Pitch-period estimation • Look for periodicity, either via the a.c.f our some other measure, for examplethat gives you a minimum value when p equals the pitch period. • Typical pitch-periods: 20 - 160 samples.

Estimating the Parameters • Vocal tract filter estimation • Find the filter coefficients that minimize the error2 = ( yn -  ai yn-i + Gn )2 • Compare to the computation of optimal predictors (Lecture 7).

Estimating the Parameters • Assuming a stationary signal:where R and p contain acf values. • This is called the autocorrelation method.

Estimating the Parameters • Alternatively, in case of a non-stationary signal:where • This is called the autocovariance method.

Example • Coding of parameters using LPC10 (1984):

The Vocal Tract Filter • Different representations: • LPC parameters • PARCOR (Partial Correlation Coefficients) • LSF (Line Spectrum Frequencies)

Code Excited Linear Prediction Coding (CELP) Encoding: • LPC analysis) V(z) • Define perceptual weighting filter. This permits more noise at formant • frequencies where it will be masked by the speech • Synthesise speech using each codebook entry in turn as the input to V(z) • Calculate optimum gain to minimise perceptually weighted error energy • in speech frame • Select codebook entry that gives lowest error • Performance: • 16kbit/s: MOS=4.2, • Delay=1.5 ms, 19 MIPS • 8 kbit/s: MOS=4.1, • Delay=35 ms, 25 MIPS • 2.4kbit/s: MOS=3.3, • Delay=45 ms, 20 MIPS • Transmit LPC parameters and codebook index • Decoding: • Receive LPC parameters and codebook index • Re-synthesise speech using V(z) and codebook entry

Examples • G.728 • V(z) is chosen as a large FIR-filter (M ¼ 50). • The gain and FIR-parametrers are estimated recursively from previously received samples. • The code book contains 127 sequences. • GSM • The code book contains regular pulse trains with variabel frequency and amplitudes. • MELP • Mixed excitation linear prediction • The code book is combined with a noise generator.

Other Variations • SELP – Self Excited Linear Prediction • MPLP – Multi-Pulse Excited Linear Prediction • MBE – Multi-Band Excitation Coding

Quality Levels

Subjective Assessment • In digital communications speech quality is classified into four general • categories, namely: broadcast, network or toll, communications, and synthetic. • Broadcast wideband speech – high quality ”commentary” speech – generally • achieved at rates above 64 kbits/s. • MOS (Mean Opinion Score): result of averaging opinions scores for a set of • between 20 – 60 untrained subjects. • They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent). • MOS of 4 or higher defines good or tool quality (network quality) • - reconstructed signal generally indistinguishable from the original. • MOS between 3.5 – 4.0 defines communication quality – telephone • communications • MOS between 2.5 – 3.5 implies synthetic quality

Subjective Assessment • DRT (Diagnostic Rhyme Test): listeners should recognise one of the two • possible words in a set of rhyming pairs (e.g. meatl/heat) • DAM (Diagnostic Acceptability Measure) - trained listeners • judge various factors e.g. muffledness, buzziness, intelligibility Quality versus data rate (8kHz sampling rate)

Speech & Audio Coding