MPEG Standards Overview: Evolution and Features in Audio-Visual Coding

MPEG Howell Istance School of Computing De Montfort University

Motion Pictures Expert Group • Established in 1988 with remit to develop standards for coded representation of audio, video and their combination • operates within framework of Joint ISO/IEC Technical Committee (JTC1 on Information Technology), organised into committees and sub-committees • originally 25 experts, now approximately 350 experts from 200 companies and academic institutions, which meet approx. 3 times/year (depends on committee) • (all) standards work takes a long time, requires international agreement, (potentially) of great industrial strategic importance

MPEG- 1 standards • video standard for low fidelity video, implemented in software codecs, suitable for transmission over computer networks • audio standard has 3 layers, encoding process increases in complexity and data rates become lower as layers increase, • Layer 1 - 192 kbps • Layer 2 - 128 kbps • Layer 3 - 64 kbps (MPEG 1 - Layer 3 = MP3) • (these data rates are doubled for a stereo signal)

MPEG1 - Layer 3 Audio encoding • Encoders analyse an audio signal and compare it to psycho-acoustic models representing limitations in human auditory perception • Encode as much useful information as possible within restrictions set by bit rate and sampling frequency • Discard samples where the amplitude is below the minimum audition threshold for different frequencies • Auditory masking - a louder sound masks a softer sound when played simultaneously or close together, so the softer sound samples can be discarded

Psychoacoustic model Throw away samples which will not be perceived, ie those under the curve

MPEG1 - Layer 3 Audio encoding • Temporal masking - if two tones are close together on the frequency spectrum and are played in quick succession, they may appear indistinct from one another • Reservoir of bytes - data is organised into ‘frames’ - space left over in one frame can be used to store data from adjacent frames that need additional space • joint stereo - very high and very low frequencies can not be located in space with the same precision as sounds towards the centre of the audible spectrum. Encode these as mono • Huffman encoding removes redundancy in the encoding of repetitive bit patterns (can reduce file sizes by 20%)

Masking effects • Throw samples in region masked by louder tone

Schematic of MPEG1 - Layer 3 encoding http://www.iis.fhg.de/amm/techinf/layer3/index.htm

MPEG - 2 standards • Video standard for high fidelity video • ‘Levels’ define parameters, maximum frame size, data rate and chrominance subsampling • ‘Profiles’ may be implemented at one or more levels • MP@ML (“main profile at main level”) uses CCIR 601 scanning, 4:2:0 chrominance subsampling and supports a data rate of 15Mbps • MP@ML used for digital television broadcasting and DVD • Audio standard essentially same as MPEG-1, with extensions to cope with surround sound

MPEG - 4 • MPEG-4 standard activity aimed to define an audiovisual coding standard to address the needs of the communication, interactive (computing) and broadcasting service (TV/film/entertainment) models • In MPEG-1 and MPEG-2, ‘systems’ referred to overall architecture, multiplexing and synchronisation. • In MPEG-4, systems also includes scene description, interactivity, content description and programmability • Initial call for proposals - July 1995, version 2 amendments - December 2000

Images from Jean-Claude Dufourd, ENST, Paris

MPEG -4 Systems - mission “Develop a coded, streamable representation for audio-visual objects and their associated time-variant data along with a description of how they are combined” • ‘coded representation’ as opposed to ‘textual representation’ - binary encoding for bandwidth efficiency • ‘streamable’ as opposed to ‘downloaded’ - presentations have a temporal extent rather than being being based on files of a finite size • ‘audio-visual objects and their associated time-variant data’as opposedto ‘individual audio or visual streams’. MPEG-4 deals with combinations of streams to create an interactive visual scene, not with encoding of audio or visual data

MPEG-4 Principles • Audio-visual objects - representation of natural or synthetic object which has a audio and/or visual manifestation (e.g video sequence, 3D animated face) • scene description - information describing where, when and for how long a-v objects will appear • Interactivity expressed in 3 requirements • client side interaction with scene description as well as with exposed properties of a-v objects • behaviour attached to a-v objects, triggered by events (e.g user generated, timeouts) • client-server interaction, user data sent back to server, server responds with modifications to scene (for example)

MPEG-4 Systems Principles Interactive scene description Scene description stream Object description stream Visual object stream Visual object stream Visual object stream Audio object stream

MPEG-4 Systems Principles Interactive scene description Scene description stream Object description stream Visual object stream Visual object stream Visual object stream Audio object stream Elementary streams

Object Descriptor Framework • Glue between scene description and streaming resources (elementary descriptors) • object descriptor: container structure- encapsulates all setup and association information for a set of elementary streams + set of sub-descriptors describing individual streams (e.g configuration information for stream decoder) • groups sets of streams that are seen as a single entity from perspective of scene description • object description framework separated from scene description so that elementary streams can be changed and re-located without changing scene description

BIFS - BInary Format for Scenes • Specifies spatial and temporal locations of objects in scenes, together with their attributes and behaviours • elements of scene and relationship between them form a scene graph that must be encoded for transmission • based heavily on VRML, supports almost all VRML nodes • does not support use of java in script nodes (only ECMAScript) • does expand on functionality of VRML - allows a much broader range of applications to be supported

BIFS expansions to VRML • Compressed binary format: • BIFS describes an efficient binary representation of the scene graph information. • Coding may be either lossless or lossy. • Coding efficiency derives from a number of classical compression techniques, plus some novel ones. • Knowledge of context is exploited heavily in BIFS. • Streaming: • scene may be transmitted as an initial scene followed by timestamped modifications to the scene. • BIFS Command protocol allows replacement of the entire scenes, addition/deletion/replacement of nodes and behavioral elements in the scene graph as well as modification of scene properties.

BIFS expansions to VRML • 2D Primitives: • BIFS includes native support for 2D scenes. • facilitates content creators who wish to produce low complexity scenes, including the traditional television and multimedia industries. • Many applications cannot bear the cost of requiring decoders to have full 3D rendering and navigation. This is particularly true where hardware decoders must be of low cost, as for instance television set-top boxes. • Rather than simply partitioning the multimedia world into 2D and 3D, MPEG-4 BIFS allows the combination of 2D and 3D elements in a single scene.

BIFS expansions to VRML • Animation: • A second streaming protocol, BIFS Anim, provides a low-overhead mechanism for the continuous animation of changes to numerical values of the components in the scene. • These streamed animations provide an alternative to the interpolator nodes supported in both BIFS and VRML. • Enhanced Audio: • BIFS provides the notion of an "audio scene graph" • audio sources, including streaming ones, can be mixed. • audio content can even be processed and transformed with special procedural code to produce various sounds effects

BIFS expansions to VRML • Facial Animation: • BIFS provides support at the scene level for the MPEG-4 Facial Animation decoder. • A special set of BIFS nodes expose the properties of the animated face at the scene level, • animated face can be integrated with all BIFS functionalities, similarly to any other audio or visual objects

MPEG Standards Overview: Evolution and Features in Audio-Visual Coding

MPEG Standards Overview: Evolution and Features in Audio-Visual Coding

Presentation Transcript

MPEG

MPEG + RTP

MPEG-21

MPEG/JPEG

MPEG-4

MPEG Video Coding I: MPEG-1

MPEG Video Coding — MPEG-2

MPEG + RTP

MPEG-4

MPEG-4

MPEG activities

MPEG 4

MPEG-2

MPEG Encoding

MPEG

MPEG Standards

MPEG + RTP

MPEG-4

MPEG-21

mpeg