A Tutorial on MPEG/Audio Compression

A Tutorial on MPEG/Audio Compression Author: Davis Pan Presenter: Klara Nahrstedt Source: “Readings in Multimedia Computing and Networking”Kevin Jeffay, Hong-Jiang Zhang, Morgan Kaufmann, 2002

Outline • Features and Applications • MPEG Audio Encoding Steps • Time to Frequency Mapping • Psychoacoustic Model • Bit Allocation • Layer Coding Options • Conclusions

Features and Applications • Generic lossy audio compression • Diverse assortment of compression modes • Possibilities of random access, audio FW, audio reverse • Sampling rate: 33, 44.1 or 48 KHz • Audio channel support: • Monophonic mode for a single audio channel • Dual-monophonic mode for two independent channels • Stereo mode for stereo channels • Joint-stereo model – takes advantage of correlations between stereo channels • Predefined bit rate: 32-224 kbps • Compression Layers: • Layer I: simplest, above 192 kbps • Layer II: intermediate complexity, 128 kbps • Layer III: most complex, but offers best audio quality for 64kbps

MPEG Audio Encoding Steps Uncompressed PCM Audio Signal input of 32 samples in time space Filter Bank Transformation from time to Frequency domain Psycho-acoustic Model 32 sub-bands Quantization/ Bit Allocation If noise level is high -> Rough quantization If noise level is low -> Finer Quantization Multiplexer Entropy Encoder Huffman Coding Ancillary Data (optional) MPEG Compressed Data

MPEG/Audio Filter Bank • Filter bank divides the PCM audio input into 32 equal-width frequency subbands • St[i] = ΣkΣjM[i][k]x(C[k+64j]x x[k+64]), where • S[i] is the filter output sample for subband i, iє[0,31]; • C[n] is one of the 512 coefficients of the analysis window defined in standard • X[n] is the audio input sample read from a 512-sample buffer • M[i][k] are analysis matrix coefficients. • Comments: • Equal widths of subbands do not reflect the human auditory system • Filter band and its inverse are lossy transformations (however the error introduced by the filter bank is small and inaudible) • Adjacent filter bands have a major frequency overlaps

MPEG/Audio Psychoacoustics • The perception sensitivity, called loudness, is not linear across all frequencies and intensities. • Some parts of an acoustic event can be measured but not heard. The reason is that part of a sound mixture masks another part. This masking effect can be observed in time and frequency domain. • MPEG/Audio compresses audio by removing acoustically irrelevant parts of audio signal • MPEG/Audio takes advantage of human auditory system’s inability to hear quantization noise under auditory masking. • Auditory masking is a perceptual property of human auditory system that occurs whenever the presence of a strong audio signal makes a temporal or spectral neighborhood of weaker audio signal imperceptible.

MPEG/Audio Psychoacoustic Model Strong tonal signal • Psychoacoustic model analyzes • Audio signal and computes the amount • of noise masking • Masking depends on signal frequency • And loudness • Width of masking curves as a function • Of frequency is a good indicator of human • Auditory system’s frequency-dependent • Behavior • Width is called the size of critical band rate • Psych. Model uses a separate, independent time-to-frequency mapping because it needs finer frequency resolution for an accurate calculation of masking thresholds Region where weaker Signals are masked amplitude frequency dB 250 Hz 1KHz …….. 500 Hz 60 dB Critical band rate

Quantization • We quantize each sub-band according to the audibility of quantization noise within the band. Goal is to make the quantization noise inaudible. • The model calculates noise-masking thresholds for each sub-band. • The model computes signal-to-mask ratio (SMR) as the ratio of the signal energy within the sub-band to the minimum masking threshold for that sub-band. • SMR is the quantization • The model passes this value to the bit allocation component. Original signal energy dB Computed Masking Threshold SMR - Quantization Coded audio energy using SMRs dB

Bit Allocation • Determines the number of code its allocated to each subband based on information from the psychoacoustic model • Algorithm: • 1. Compute Mask-to-Noise Ratio (MNR) • MNRdB:= SNRdB – SMRdB • MPEG/Audio standard provides tables that give estimates for SNR resulting from quantizing to a given number of quantizer levels • 2. Get MNR for each sub-band • 3. Search for sub-band with the lowest MNR • 4. Allocate code bits to this sub-band. If subband gets allocated more code bits than appropriate, then look up new estimates of SNR and re-compute MNR, i.e., repeat step 1.

Layer Coding Options • Layer 1 codes audio in frames of 384 samples • Groups 12 samples from each of the 32 subbands • Layer 2 codes audio in frames of 1,152 samples • Groups three groups of 12 bytes for each subband • Layer 3 is derived from ASPEC (audio spectral perceptual entropy coding) and OCF (optimal coding in frequency domain) • Based on the same filter band as Layer 1 and 2, but compensates for filter bank deficiencies • Entropy coding used only in Layer 3 12 samples Layer 1 Layer 2 and 3

Layer1 Example • Each group of 12 samples gets a bit allocation and if the bit allocation is not zero, a scale factor • Layer 1 uses bit allocation of 0 to 15 bits per subband • Scale factor is a multiplier that sized the samples to fully use the range of the quantizer. • Scale factor has a 6 bit representation • Decoder multiplies the decoded quantized output with scale factor to recover the quantized subband value. • The dynamic range of scale factors alone exceeds 120 dB • The combination of bit allocation and scale factor provide potential for representing samples with dynamic range well over 120 dB. Bit Allocation (128-256) Scale factors (0-384 Header (32) CRC (0,16) Ancillary Data Samples

Final MPEG/Audio Comments • Real reason for precision of 16bits per sample is to get a good signal-to-noise ratio • Noise we are getting is quantization noise from the digitization process • For each added bit in bit allocation process, we get 6dB better SNR • Masking effect means that we can raise the noise floor around a strong sound because the noise will be masked anyway • Raising the noise floor is the same as using less bits and using less bits is the same as better compression • Masking effect occurs also before and after strong sound (pre/postmasking). • Premasking: 2-5ms • Postmasking: 100 ms

A Tutorial on MPEG/Audio Compression