Media Processing – Audio Part

Media Processing – Audio Part Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Tentative outline • Week 6: Fundamentals of audio • Week 7: Audio acquiring, recording, and standards • Week 8: Audio processing, coding, and standards • Week 9: Audio production and reproduction • Week 10: Audio perception and audio quality assessment

Fundamentals of audio Concepts and topics to be covered: • What is sound • Physical characteristics of audio • Amplitude, frequency, envelope, phase, etc • Pure tone versus complex sounds • Sound pressure level • Haas effect, inverse square law • Free and reverberant sound field • Audio signal chain • Early recording systems (mechanical, electrical, and more) • Modern recording chain (stereo, multitrack) • Broadcast distribution

What is sound • Sound is produced by a vibrating source around which the carrying media (e.g. air) is caused to move. From the vibrating source, a sound wave radiates omnidirectionally away from itself, and the sound energy is transferred to the carrying media through compressions and rarefactions (similar to waves moving on the surface of the sea). Source: Francis Rumsey and Tim McCormick (1994)

Displaying a sound wave • Time domain method: such as waveform, using oscilloscope. • Frequency domain method: such as spectrum, using spectrum analyser. Source: Francis Rumsey and Tim McCormick (1994) Demos spectrum analyser: http://www.youtube.com/watch?v=LIOUXr9v2RI http://www.youtube.com/watch?v=YS6jaqeXVok

Basic characteristics of sound wave • Frequency: the rate at which the vibrating source oscillates, quoted in hertz (Hz) or cycles per second (cps). The human ear is able to perceive sounds with frequencies between approximately 20 Hz and 20 kHz (known as audio frequency range). • Amplitude: the amount of compression and rarefaction of the carrying media resulting from the motion of the vibrating source, related to the loudness of the sound when perceived by human ears. • Wavelength: the distance between two adjacent peaks of compression or rarefaction as the wave travels through the carrying media, often represented by the Greek letter lambda. • Velocity: the speed of the sound energy transfer. The velocity of sound in air is about 344 meters per second. The velocity of sound depends the carrying media and also its density. Wavelength = Velocity/Frequency • Envelope: the shape of sound wave evolution. It includes four main parts: the attack, the initial decay, the sustain (i.e. internal dynamic), and the final decay (i.e. release). • Phase: the time course of a signal relative to a reference arriving to a receiver (i.e. ear).

A simple sound – pure tone (sine wave) Source: Francis Rumsey and Tim McCormick (1994) A demo for sound waves: http://www.youtube.com/watch?v=dbeK1fg1Rew&list=PLC9626B413EC82543

Harmonic (repetitive) sounds • Different from the pure tone (single frequency, which is not commonly heard in real life), real music audio is much more complex and is a mixture of harmonic sounds (multiple frequencies). • Harmonic frequencies are the integer multiples of the fundamental frequency. For example, the first harmonic is equal to the fundamental frequency, the second harmonic (also known as the first overtone, or partial) is a doubling of the fundamental frequency, the third harmonic (i.e. the second overtone) is three times of the fundamental frequency, etc. Harmonics Source: Alan P. Kefauver and David Patschke (2007)

Harmonic (repetitive) sounds Modes of vibration of a stretched string: (a) fundamental, (b) second harmonic, (c) third harmonic. Source: Francis Rumsey and Tim McCormick (1994)

Harmonic (repetitive) sounds • The existence of harmonics is due to the fact that most vibrating sources are able to vibrate in a number of harmonic modes simultaneously. • As an example shown on the figure in the previous slide for a stretched string, it may be made to vibrate in any of a number of modes, corresponding to integer multiples of the fundamental frequency of vibration of the string. • The fundamental corresponds to the mode in which the string moves up and down as a whole, while the harmonics correspond to the mode in which the vibration pattern is divided into points of maximum and minimum motion along the string (called antinodes and nodes). • It is possible that the overtones in a sound spectrum are not exactly the integer multiple of the fundamental. In this case, they are more correctly referred to as overtones or inharmonic partials. This tends to happen for vibrating sources with a complicated shape, such in a bell or a percussion instrument.

Harmonic (repetitive) sounds Spectrum representation of a selection of some simple waveforms. (a) The sine wave contains only a single frequency. (b) The sawtooth wave consists of components at the fundamental and its integer multiples, with amplitudes gradually decreasing. (c) The square wave consists of frequencies at odd multiples of the fundamental. Source: Francis Rumsey and Tim McCormick (1994)

Sound complexity • Sound like sine wave is not heard frequently. A person whistling or a wind instrument can produce sounds similar to a sinusoidal wave. • Real sounds are usually made up of a combination of more complex waveforms. The more complex the waveform, the more like noise the sound becomes. • When the waveform has a highly random pattern, the sound becomes noise (as shown in the next page).

Non-repetitive sound Wave form and frequency spectra of non-repetitive waveforms, (a) pulse, (b) noise. Source: Francis Rumsey and Tim McCormick (1994)

Non-repetitive sound • Non-repetitive sound waves do not have a recognisable frequency as in harmonic sounds. • The frequency spectrum of non-repetitive sound consists of a collection of unrelated frequency components. • As shown in the examples, short pulses have continuous frequency spectra extending over a wide frequency range. The shorter the pulse, the wider its frequency spectrum but usually the lower its totally energy. • A completely random waveform is known as white noise, where the frequency amplitude and phase of components are equally probable and constantly varying. Its spectrum is flat, and it has equal energy for a given bandwidth.

The envelope of an audio signal Source: Alan P. Kefauver and David Patschke (2007)

The envelope of an audio signal • The attack is the time for the sound generator (e.g. music instruments) to respond (i.e. vibrate) to a strike, which depends on the materials that the instruments are made from. A softly blown flute has a longer attack time than a sharply struck snare drum. Also, struck instruments have an attack time (in the 1- to 20- ms range) that is much shorter than wind instruments (in the 60- to 200- ms range) • The initial decay is caused by the cessation of the striking force that set the instruments to vibrate. • The sustain is referred to the leveling-off period of the sound, when the sound energy becomes stable. • The final decay occurs when the sound is no longer played by the player or by the resonance of the vibrating medium. It varies from as short as 250 ms to 100 s, depending on the vibrating medium. In addition, different frequencies decay in different rate, causing a change in sound timbre.

What is phase • The height of the spot varies sinusoidally with the angle of rotation of the wheel, i.e. it rises and falls regularly as the circles rotates at a constant speed. The phase angle of a sine wave can be understood in terms of the number of degrees of rotation of the wheel. Source: Francis Rumsey and Tim McCormick (1994)

Phase difference versus angles • For the phase relationship between two waves of the same frequency, if each cycle is considered as corresponding to 360 degree, then the phase difference between the two waves can be determined by comparing the 0 degree point on one wave with the 0 degree point on the other. In the above example, the top signal is 90 degree out of phase with the below signal. Source: Francis Rumsey and Tim McCormick (1994)

Phase difference versus time delay Source: Francis Rumsey and Tim McCormick (1994) • If two signals (of same frequency) start out at sources equidistant from a listener simultaneously, then they arrive at the listener in phase. If one source is more distant than the other, it will be delayed, and the phase relationship between the two is dependant on the amount of delay. • Sound travels about 30 cm per ms, therefore, if speaker 2 is 1 meter more distant than speaker 1, then it would be delayed by just over 3 ms. As the resulting phase also depends on the frequency of the sound, for a 330 Hz sound, the 3 ms delay would correspond to one wavelength, thus for the signals at this frequency, the delayed one would be in phase with the first one.

Some facts about phase • Phase difference between signals can be caused by the time delays between them. It is usually measured as a number of degrees relative to some reference. • Phase is a relevant concept in the case of continuous repetitive waveforms, and has little meaning for impulsive or transient sounds where time difference is a more relevant quantity. • For a given time delay between two signals, the higher the frequency, the greater the phase difference. • It is possible that the phase difference between two signals is greater than 360 degree if the delay of the second signal is great enough to be more than one cycle. Demo for sound and its characteristics: http://www.youtube.com/watch?v=cK2-6cgqgYA

In phase & out of phase In phase: If the compression (positive) and rarefaction (negative) half-cycles of two waves of the same frequency coincide exactly in time and space. If the two signals are added together, they will produce another signal of the same frequency but twice the amplitude. Out of phase: If the positive half-cycle of one signal coincides with the negative half-cycle of the other. If these two are added together, they will cancel each other. Partially out of phase: when they are added together, the phase and amplitude will be the point-by-point sum of the two, resulting partial addition or cancellation, and a phase somewhere between those of the original two. Source: Francis Rumsey and Tim McCormick (1994)

Decibel • Decibel is a widely used unit in audio engineering to measure the ratio of one signal’s amplitude to another’s. It is usually in preference to other units such as volts, watts, and other such absolute units, as it approximates more closely to one’s subjective impression of changes in the amplitude of a signal. • It helps to compress the range of values between the maximum and minimum sound levels encountered in real signals. For example, the human ear can perceive the sound intensities ranging from 0.000 000 000 001 W per square meters, to around 100 W per square meters, described in decibels, it is from 0 dB to 140 dB. • It is also used to describe the voltage gain of a device. For example, a microphone amplifier has a gain of 60 dB means a multiplication of the input voltage by a factor of 1000, i.e. 20 log 1000/1 = 60 dB. • Decibels applies to both acoustical sound pressure (analogous to electrical voltage), i.e. dB = 20log10(V1/V2) and sound power (analogous to electrical power), i.e. dB = 10log10(P1/P2). • If a signal level is quoted in decibels, then a reference must be given. Therefore, 0 dB means the signal concerned is the same level as the reference, instead of “no signal”. The reference level of sound pressure levels (SPL) is defined worldwide as 20muPa.

Sound in electrical form • In order to perform operations (such as amplification, recording, and mixing) on sound, it is often converted from an acoustical form into an electrical form. This is achieved by a microphone. • The equivalence between the acoustical and electrical signals: • Voltage of the electrical signal at the output of the microphone <-> acoustic compression and rarefaction of the air • Current flowing down the wire from the microphone <-> acoustic wave carried in the motion of the air particles Source: Francis Rumsey and Tim McCormick (1994)

Sound power and sound pressure • Sound source generates a certain amount of power, measured in watts. Sound pressure is the effect of sound power on its surroundings. (like the relation between the heat energy generated by a radiator versus the temperature in the room). • Sound pressure level (SPL) is measured by Newtons per square meter, and often conveniently quoted in decibels. SPL = 0 dB is approximately equivalent to the threshold of hearing (the quietest sound perceived by an average person) at a frequency of 1 kHz. • The amount of acoustical power generated by real sound sources is surprisingly small, compared with the number of watts of electrical power involved in lighting a light bulb. Most daily sources generate fractions of a watt of sound power. An acoustical source radiating 20 watts would produce a sound pressure level close to the threshold of pain if the listener is close to the source.

Haas effect & Inverse square law • Sound power decreases gradually over an increasingly large area when travelling away from the source. • The law of decreasing power per unit area (intensity) of a wavefront with increasing distance from the source is known as inverse square law, as the intensity of sound drops in proposition to the inverse square of the distance from the source. In practice, the amplitude drops about 6 dB for every doubling in distance from the source in free space (i.e. no nearby reflecting surfaces). • The Haas effect, also known as the precedence effect, states that the sound with shortest timing difference (reaching the two ears) appears louder and therefore closer. Haas found that when the delay was greater than 5 but less than 30 ms, the amplitude of the delayed sound has to be 10 dB louder than the non-delayed sound for them to be perceived as equal.

Haas effect & Inverse square law Source: Francis Rumsey and Tim McCormick (1994)

Measuring sound pressure level • A SPL meter which is usually used to measure the level of sound at a particular point, is a device consisting of a microphone, amplifier, filters and a meter, as shown in the following figure, where the weighting filter is used to attenuate different frequency bands according to the sensitivity of human hearing. Source: Francis Rumsey and Tim McCormick (1994)

A professional sound level meter(B&K Corp.) Source: Alan P. Kefauver and David Patschke (2007)

Typical sound pressure levels in dBs Source: Alan P. Kefauver and David Patschke (2007)

Free and reverberant fields • As the distance from a source increases direct sound level drops but reverberant sound level remains roughly constant. The resultant sound level experienced at different distances from the source depends on the reverberation time of the room, as in a reverberant room the level of reflected sound is higher than in a ‘dead’ or ‘dry’ room (without reverberations). Source: Francis Rumsey and Tim McCormick (1994)

Free and reverberant fields • Direct sound: sound source that directly arrives to the listener without reflections. • Early reflections: the reflected sounds (echoes) from nearby surfaces in a room which arrives to the listener with first few milliseconds (up to 50 ms). • Late reflections: the echoes that reach listener’s ears after about 50 ms from the direct sound. Source: Francis Rumsey and Tim McCormick (1994)

Early mechanical recording machines Source: Francis Rumsey and Tim McCormick (1994)

Early mechanical recording machines (cont.) • The early recording machines, developed by Edison and Berliner in late 19th century, were completely mechanical or ‘acoustic’. Such systems (Edison’s ‘phonograph’, and Berliner’s ‘gramophone’), as shown in the previous slide, typically consisting of horn, diaphragm, stylus, and cylinder covered by soft foil. • The recordist spoke or sang into the horn causing the diaphragm and stylus to vibrate, which then inscribes a modulated groove into the surface of the soft foil or gramophone disk. On reproduction, the modulated groove would cause the stylus and diaphragm to vibrate, resulting in a sound wave being emitted from the horn. • The sounds recorded and reproduced by such early systems have limited frequency range and are very distorted. • Recordings made directly on the disk last for the duration of the side of the disk, at the time, a maximum of around 4 minutes, with no possibility of editing, and for long music, this would be recorded in short sections with gaps to change the disk. In addition, instruments need to be placed quite tightly around the pickup horn for them to be heard on the recording.

Electrical recording machines Source: Francis Rumsey and Tim McCormick (1994)

Electrical recording (cont.) • Early electrical recording machines, developed during 1920s, were based on the principles of electromagnetic transduction. • In such systems, the outputs of microphones can be mixed together before being fed to the disk cutter. Variable resistors could also be inserted to the system to control the levels from each microphone, and valve amplifiers could be used to increase the electrical level for it to be suitable to drive the stylus transducer. • The microphones can be placed remotely from the recording machines, having more flexibility in their positions. • The sound quality of electrical recordings is markedly improved over the mechanical recordings, with a wider frequency range and a greater dynamic range.

Electromagnetic transducers Source: Francis Rumsey and Tim McCormick (1994)

Electromagnetic transducers (cont.) • If a wire is made to move in magnetic field, perpendicular to the lines of the flux linking the poles of the magnet, then an electric current is induced in the wire (as shown in the diagram of the previous page). • The direction of motion governs the direction of the current flow in the wire, and thus if the wire is made to move back and forth then an alternating current can be induced in the wire. Conversely, if a current is made to flow through the wire which cuts the lines of a magnetic field then the wire will move. • A simple moving-coil microphone would involve a wire moving in a magnetic field, by means of a coil attached to a flexible diaphragm which vibrates in sympathy with the sound wave. The frequency of the electrical signal (i.e. the output of the microphone) is the same as that of the sound wave, and the amplitude of the electrical signal is proportional to the velocity of the coil.

Late developments in sound recording • 1940s, the introduction of the first AC-biased tape recorders, when tape was first made of paper coated with metal oxide which tends to deteriorate rather quickly, and later of plastics which was shown to last longer and easier to handle. • 1950s, the introduction of microgroove LP (long-playing) recorders, with lower surface noise and better frequency response. • 1960s, the development of stereo recordings and the introduction of multitracktape recorders. • Later developments include high-quality digital recording, with Compact Disc and digital tape systems.

Examples of modern recording chain - stereo recording Source: Francis Rumsey and Tim McCormick (1994)

Examples of modern recording chain - stereo recording (cont.) • Recording: Microphone sources are mixed ‘live’ to create a stereo session master, either analogue or digital, which is a collection of recordings. The balance between the sources must be correct at this stage. • Editing: The sound source materials are assembled in an artistically satisfactory manner, under the control of the producer, to create a final master. • Mastering: In this stage, additional and/or special information is added to the master for commercial release of the recordings to LPs, cassettes and CDs. For example, an LP master requires special equalisation to prepare it for cutting to the disk. • Stereo recording is a cheaper and less time consuming way of production than multitrack recording, but requires skill to achieve a usable balance quickly.

Examples of modern recording chain - multitrack recording Source: Francis Rumsey and Tim McCormick (1994)

Examples of modern recording chain - multitrack recording (cont.) • Recording: Acoustic and electrical sources are fed into a mixer and recorded onto multitrack tape. The resulting tape contains a collection of individual sources on multiple tracks. Individual songs or titles are recorded in separate places on the tape. In track-laying mixer, a master is usually built up by laying down backing tracks (drums, keyboards, rhythm guitars, etc.) for a complete song. • Mixing: The multiple tracks are mixed down into final format, usually stereo. • Mastering: In this stage, additional and/or special information is added to the master for commercial release of the recordings to LPs, cassettes and CDs. For example, an LP master requires special equalisation to prepare it for cutting to the disk. • Multitrack recording is usually created in the recording studio.

Broadcast distribution Source: Francis Rumsey and Tim McCormick (1994)

Broadcast distribution (cont.) • A typical television sound signal from an outside broadcast location may have travelled over a large number of miles (as shown by the figure in the previous page). • A radio microphone may transmit locally to the outside broadcast (OB) vehicle, which may then use a microwave radio link to send the signal back to the studio centre. • The sound signal may then travel through a length of internal cabling within the studio centre, finally to be connected to the network which distributes the signals around the country. • The network is in the form of either ‘land-lines’ (equalised to compensate the losses due to the distances) or wireless radio-frequency (RF) links (maybe digitally encoded). • In the receiver side, the signal is distributed to the consumer from the transmitter through a further RF link.

References • Francis Rumsey and Tim McCormick, Sound and Recording: an Introduction, 1994. • Alan P. Kefauver and David Patschke, Fundamentals of Digital Audio, 2007.

Media Processing – Audio Part