Università degli studi Roma Tre Overview of the H.264AVC video coding standard Maiorana Emanuele

Università degli studi Roma Tre Overview of the H.264AVC video coding standard Maiorana Emanuele

Agenda • Introduction • Project overview & timeline • Standardization concepts • Codec technical design • New Fewtures • Prediction • De-Blocking • Entropy Coding • Profiles & levels • Comparisons

Introduction H.264/AVC is newest video coding standard developed by the ITU-T/ISO/IEC Joint Video Team (JVT), consisting of experts from: • ITU-T Video Coding Experts Group (VCEG) • ISO/IEC Moving Picture Experts Group (MPEG) Its design represents a delicate balance between: • Coding Gain (improved efficiency by a factor of two over MPEG-2) • Implementation complexity • Costs based on state of VLSI (ASICs and Microprocessors) design technology

Terminology The following terms are used interchangeably: • H.26L • The Work of the JVT or “JVT CODEC” • JM2.x, JM3.x, JM4.x • The Thing Beyond H.26L • The “AVC” or Advanced Video CODEC Proper Terminology going forward: • MPEG-4 Part 10 (Official MPEG Term) • ISO/IEC 14496-10 AVC • H.264 (Official ITU Term)

History The digital representation of the TeleVision signals created many services for the content delivery: • Satellite • Cable TV • Terrestrial Broadcasting • ADSL and Fiber on IP To optimize this services, there is the need of: • High Quality of Service (QoS) • Low Bit-Rate • Low Power Consumption The source coding is responsible for the reduction of the bit-rate. Example: the complete transmission of the TV signal, as in Recommendation ITU-R BT.601, would require: 720 × 576 + 2 (360 × 576) × 25 × 8= 166 Mbit/s Conflicting

Video Coding History Efforts in maximizing coding efficiency while dealing with: • diversification of network types • characteristic formatting and loss/error robustness requirements. ITU Standard for VideoTelephony • H.261, H.263, H.263+ ISO-MPEG Standard • MPEG-1:medium quality, physical support • MPEG-2:medium/high quality, physical and transmission support • MPEG-4:audio video objects

Evolution (1/2) MPEG-2 Introduction MPEG-4 in Comparison

Evolution (2/2) H.26L Provides Focus MPEG-4 “Adopts” H.26L

Codec Defects Blocking Low Original High

Codec Defects Packet Loss

Video Coding History • Early 1998: Started as ITU-T Q.6/SG16 (VCEG - Video Coding Experts Group) “H.26L” standardization activity • August 1999: first draft design • July 2001: MPEG open call for “AVC” technology: H.26L wins • December 2001: Formation of the Joint Video Team (JVT) between VCEG and MPEG to finalize H.26L as a joint project similar to MPEG-2/H.262 • July 2002: Final Committee Draft status in MPEG • March 2003: formal approval submission • October 2004: final ITU-T and ISO approvation

Versions

Source Pre-Processing Encoding Post-Processing & Error Recovery Decoding Destination Scope of the Standard JVT Project Technical Objectives Primary technical objectives: • Significant improvement in coding efficiency: Average bit rate reduction of 50% compared to any other video standard • Network-friendly video representation for “conversational” (video telephony) and “non-conversational” (storage, broadcast or streaming) applications • Error resilient coding • Simple syntax specification and targeting simple and clean solutions The scope of the standardization is only the central decoder, by: • imposing restrictions on the bitstream and syntax • defining the decoding process such that every conforming decoder will produce similar output with an encoded bitstream input

Applications The new standard is designed for technical solutions including at least the following application areas • Broadcast over cable, satellite, Cable Modem, DSL, terrestrial, etc. • Interactive or serial storage on optical and magnetic devices, DVD, etc. • Conversational services over ISDN, Ethernet, LAN, DSL, wireless and mobile networks, modems, etc. or mixtures of these. • Video-on-demand or multimedia streaming services over ISDN, Cable Modem, DSL, LAN, wireless networks, etc. • Multimedia Messaging Services (MMS) over ISDN, DSL, Ethernet, LAN, wireless and mobile networks, etc. How to handle this variety of applications and networks?

H.264 Design To address this need for flexibility and customizability, the H.264/AVC design covers a: • Video Coding Layer (VCL): representation of the video content (performing all the classic signal processing tasks) • Network Abstraction Layer (NAL): adaptation of VCL representations in a manner appropriate for conveyance by a variety of transport layers or storage media Control Data Video Coding Layer Coded Macroblock Data Partitioning Coded Slice/Partition Network Abstraction Layer H.320 MP4FF H323/IP MPEG-2 etc.

Features enhancing coding efficiency (1) Enhancements on picture encoding is enabled through value prediction methods • Variable block-size motion compensation with small block sizes • Quarter-sample-accurate motion compensation • Multiple reference picture motion compensation • Weighted prediction • Directional spatial prediction for intra coding • In-the-loop deblocking filtering

Features enhancing coding efficiency (2) Enhancements on picture encoding is enabled through high performance tools • Small block-size transform • Hierarchical block transform • Short word-length transform • Exact-match inverse transform • Arithmetic entropy coding • Context-adaptive entropy coding

Features enhancing Robustness Enhancements on Robustness to data errors/losses and flexibility for operation over a variety of network environments is enabled by new design aspects new • Parameter set structure • NAL unit syntax structure • Flexible slice size • Flexible macroblock ordering (FMO) • Redundant pictures • SP/SI synchronization/switching pictures

Network Abstraction Layer The Network Abstraction Layer (NAL) is designed in order to provide "network friendliness“, facilitating the ability to map H.264/AVC VCL data to transport layers such as • RTP/IP for any kind of real-time wire-line and wireless Internet services (conversational and streaming) • File formats, e.g. ISO MP4 for storage and MMS • H.32X for wireline and wireless conversational services • MPEG-2 systems for broadcasting services, etc. Some key concepts of the Network Abstraction Layer are: • NAL Units • Use of NAL Units in: • Byte stream format systems • Packet-Transport systems • Parameter Sets • Access Units

NAL Units The coded video data is organized into NAL units, each of which is effectively a packet that contains an integer number of bytes NAL units are classified into • VCL NAL units: contain the data associated to the video pictures • non-VCL NAL units: contain any associated additional information Header byte: first byte of each NAL unit; contains an indication of the type of data in the NAL unit, and the remaining bytes contain payload data of the type indicated by the header Emulation Prevention Bytes: bytes inserted in the payload data to prevent the accidentally generation of a particular pattern of data called a start code prefix NAL unit stream: series of NAL units generated by an encoder

Use of NAL Units Bitstream-oriented transport systems (H.320, MPEG-2 systems) • Delivery of the entire or partial NAL unit stream as an ordered stream of bytes or bits  the locations of NAL unit boundaries need to be identifiable • In the byte stream format, each NAL unit is prefixed by a specific pattern of three bytes called a start code prefix Packet-oriented transport systems(IP, RTP systems) • The coded data is carried in packets that are framed by the system transport protocol  the boundaries of NAL units within the packets can be established without use of start code prefix patterns • The NAL units can be carried in data packets without start code prefixes.

Parameters Set A Parameter Set contains information that is expected to rarely change. Types of parameter sets: • sequence parameter sets: relative to a series of consecutive coded video pictures (coded video sequence) • picture parameter sets: relative to one or more individual pictures. Parameter sets can be sent • One time (ahead the VCL NAL Units) • Many time (to provide robustness) • In-band (same VCL NAL Unit Channel) • Out-of-Band (different Channel) Out-of-band Transmission

Access Units A set of NAL units in a specified form is referred to as an Access Unit. The decoding of each access unit results in one decoded picture. It can be composed by: • access unit delimiter: to aid in locating the start of the access unit. • supplemental enhancement information (SEI): containing data such as picture timing information • primary coded picture: set of VCL NAL units that represent the samples of the video picture. • redundant coded pictures: for use by a decoder in recovering from loss or corruption • end of sequence: if the coded picture is the last picture of a coded video sequence • end of stream: if the coded picture is the last coded picture in the entire NAL unit stream

Coder Control Input Video Signal Split into Macroblocks 16x16 pixels Control Data Transform/Scal./Quant. Quant.Transf. coeffs - Decoder Scaling & Inv. Transform Entropy Coding Intra-frame Prediction De-blocking Filter Output Video Signal Motion- Compensation Intra/Inter Motion Data Motion Estimation Video Coding Layer The VCL design follows the so-called block-based hybrid video coding approach There is no single coding element in the VCL that provides the majority of the significant improvement in compression efficiency in relation to prior video coding standards.

Video Coding Layer • The picture is split into blocks. • Intra coded of the first picture or a random access point • Inter coding for all remaining pictures or between random access points • Transmission of the motion data as side information • Transform of the residual of the prediction (Intra or Inter) • Quantization of the transform coefficients • Entropy coding and transmission of the quantized transform coefficients, together with the side information

Pictures, frames, and fields A coded pictures can represent either an entire frame or a single field A frame of video can be considered to contain two interleaved fields • interlaced frame: the two fields of a frame were captured at different time instants • progressive frame The coding representation in H.264/AVC is primarily agnostic with respect to this video characteristic

Adaptive frame/field coding operation In interlaced frames with regions of moving objects, two adjacent rows tend to show a reduced degree of statistical dependency H.264/AVC design allows any of the following decisions for coding a frame: • Frame mode: combine the two fields together • Field mode: not combine the two fields together The choice can be made adaptively for each frame and is referred to as Picture Adaptive Frame/Field (PAFF) coding Field mode: • Motion compensation utilizes reference fields • De-blocking filter is not used for horizontal edges of macroblocks Moving region field mode Non-moving region frame mode

Sampling YCbCr color space H.264/AVC uses a sampling structure called 4:2:0 sampling with 8 bits of precision per sample The chroma component has one fourth of the number of samples than the luma component (in both the horizontal and vertical dimensions) Y is called luma, and represents brightness. Cb and Cr are called chroma, and represent the deviation from gray toward blue and red

Macroblocks and Slices Fixed-size macroblocks partition with 16x16 samples of the luma component and 8x8 samples of each of the two chroma components. Slices are a sequence of macroblocks which are processed in the order of a raster scan when not using Flexible Macroblock Ordering (FMO). A picture is a collection of one or more slices in H.264/AVC. Indipendency Each slice can be correctly decoded without use of data from other slices.

Flexible Macroblock Ordering FMO uses the concept of slice groups A set of macroblocks defined by a macroblock to slice group map, specified in the picture parameter set A slice is a sequence of macroblocks within the same slice group Useful for concealment in video conferencing applications

Slice Types I slice: a slice in which all macroblocks of the slice are coded using intra prediction  is coded exploiting only the spatial correlation P slice: In addition to the coding types of the I slice, some macroblocks of the P slice can also be coded using inter prediction with backward references (I or P slices) B slice: In addition to the coding types available in a P, some macroblocks of the B slice can also be coded using inter prediction with forward references (I, P or B slices) The following two coding types for slices are new: SP slice: a slice that is coded such that efficient switching between different pre-coded pictures becomes possible SI slice: a slice that allows an exact match of a macroblock in an SP slice for random access and error recovery purposes

Motivation for SP and SI slices The best-effort nature of today’s networks causes variations of the effective bandwidth available to a user For Video Streaming, the server should adjusting, on the fly, source encoding parameters Representation of each sequence using multiple and independent streams Prior video encoding standards Switching is possible only at I-frames. H.264 Identical SP-frames can be obtained even when they are predicted using different reference frames.

Intra-frame Prediction In all slice-coding types, the following types of intra coding are supported • Intra_4x4 with chroma prediction: areas of a picture with significant detail • Intra_16x16 with chroma prediction : very smooth areas of a picture • I_PCM: values of anomalous picture content (accurately representation) Intra prediction in H.264/AVC is always conducted in the spatial domain IDR: picture composed of slice I only • can be decoded without any reference • no subsequent picture in the stream will require reference to pictures prior to IDR Chroma samples: similar prediction technique as for the luma component in Intra_16x16 macroblocks

A C D E F G H M B a b c d I J e f g h i j k l K L m n o p Labelling of prediction samples(4x4) 0 (vertical) 1 (horizontal) M A B C D E F G H M A B C D E F G H I I J J K K L L 2 (DC) 3 (diagonal down-left) M A B C D E F G H M A B C D E F G H I I (A+B+C+D+ I+J+K+L)/8 J J K K L L Intra_4x4 mode

A C D E F G H M B a b c d I J e f g h i j k l K L m n o p Labelling of prediction samples(4x4) 4 (diagonal down-right) M A B C D E F G H I J K L 5 (vertical-right) 6 (horizontal-down) 7 (vertical-left) A C D E F G H A C D E F G A C D E F G M B M B H M B H I I I J J J K K K L L L Intra_4x4 mode

A C D E F G H M B a b c d I e f g h J i j k l K L m n o p Labelling of prediction samples( 4x4) 8 (horizontal-up) A C D E F G M B H I J K L Intra_4x4 mode When samples E-H are not available, they are replaced by D

Intra_16x16 and I_PCM mode Intra_16x16 • mode 0: vertical • Mode 1: horizontal • mode 2: DC • Mode 3: plane (a linear “plane” function is fitted to the upper and left-hand samples  in areas of smoothly-varying luminance) I_PCM sends directly the values of the encoded samples, to to precisely represent them

Inter-frame Prediction in P Slices Partitions with luma block sizes of 16x16, 16x8, 8x16, and 8x8 samples are supported by the syntax. In case partitions with 8x8 samples are chosen, one additional syntax element for each 8x8 partition is transmitted. The prediction signal is specified by • a translational motion vector • a picture reference index a maximum of sixteen motion vectors may be transmitted for a single P macroblock.

Algorithm The encoder selects the “best” partition size for each part of the frame, to minimize the coded residual and motion vectors. The macroblock partitions chosen for each area are shown superimposed on the residual frame. • little change between the frames (residual appears grey)  a 16x16 partition is chosen • detailed motion (residual appears black or white)  smaller partitions are more efficient. Residual (no motion compensation)

Effects (1/2) Frame Fn Frame Fn-1 Residual (no motion compensation) Residual (16x16 bock size)

Effects (2/2) Residual (8x8 bock size) Residual (4x4 bock size) Residual (4x4 bock size; half pixel) Residual (4x4 bock size; quarter pixel)

Example (1/2) Frame Fn Reconstructed reference Frame F’n-1 Residual Fn – F’n-1 (no motion compensation) 16x16 Motion Vector Field

Example (2/2) Motion compensation reference frame Motion compensation residual frame

Motion Estimation Accuracy The accuracy of motion compensation is in units of one quarter of the distance between luma samples. Integer-sample position the prediction signal consists of the corresponding samples of the reference picture Non integer-sample position the corresponding sample is obtained using interpolation to generate non-integer positions. The prediction values at half-sample positions are obtained by applying a one-dimensional 6-tap FIR Wiener filter horizontally and vertically

Motion Estimation Accuracy • Half sample positions (aa, bb, b, s, gg, hh and cc, dd, h, m, ee, ff) are derived by first calculating intermediate values Ex.: b1 = ( E – 5 F + 20 G + 20 H – 5 I + J ) h1 = ( A – 5 C + 20 G + 20 M – 5 R + T ) b = (b1 + 16) >>5 h = (h1 + 16) >> 5 • Position j j1 = cc1 – 5 dd1 + 20 h1 + 20 m1 – 5 ee1 + ff1 j = ( j1 + 512) >> 10 • Quarter sample positions (a, c, d, n, f, i, k, q) are derived by averaging with upward rounding of the two nearest samples at integer and half sample positions Ex.: a = ( G + b + 1 ) >> 1 • Quarter sample positions (e, g, p, r) are derived by averaging with upward rounding of the two nearest samples at half sample positions in the diagonal direction as, for example, by Ex.: e = ( b + h + 1 ) >> 1

Motion Estimation Accuracy The prediction values for the chroma component are always obtained by bi-linear interpolation. For chroma the resolution is halved (4:2:0) therefore the motion compensation precision is down to one-eighth pixel  ¼ pixels accuracy a = round{[(8-dx)·(8-dx)·A]+dx·(8-dy)·B+(8-dx)·dy · C+dx ·dy ·D ]/64} Ex.: a = round[(30A+10B+18C+6D)/64]

Multi-picture Prediction Multi-picture motion compensation using previously-encoded pictures as references allows up to 32 reference pictures to be used in some cases Very significant bit rate reduction for scenes with • rapid repetitive flashing • back-and-forth scene cuts • uncovered background areas

Inter-frame Prediction in B Slices The concept of B slices is generalized in H.264/AVC Other pictures can reference pictures containing B slices for motion-compensated prediction Some macroblocks or blocks may use a weighted average of two distinct motion-compensated prediction values for building the prediction signal In B slices, four different types of inter-picture prediction are supported: • list 0 (backward) • list 1 (forward) • bi-predictive: weighted average of motion-compensated list 0 and list 1 prediction signals • direct prediction: inferred from previously transmitted syntax elements It is also possible to have both motion predictions from past, or both motion predictions from future.

Transform: types Each residual macroblock is transformed, quantized and coded H.264 uses a smaller size transform H.264 uses three transforms depending on the type of residual data that has to be coded • a 4x4 transform for the luma DC coefficients in Intra_16x16 macroblocks • a 2x2 transform for the chroma DC coefficients • a 4x4 transform for all other blocks Adaptive block size transform mode Further transforms are (eventually) chosen depending on the motion compensation block size (4x8, 8x4, 8x8, 16x8, etc)

Transform: Order For a 16x16 Intra mode coded Macroblock • “-1” Block DC coefficient of each 4x4 luma block • “0-15” Blocks Luma residual blocks • “16-17” Blocks DC coefficients from the Cb and Cr components • “18-25” Bloks Chroma residual blocks Coding of smooth areas

Università degli studi Roma Tre Overview of the H.264AVC video coding standard Maiorana Emanuele

Università degli studi Roma Tre Overview of the H.264AVC video coding standard Maiorana Emanuele

Presentation Transcript

Overview of the H. 264/AVC video coding standard

UNIVERSITÀ DEGLI STUDI DI ROMA “La Sapienza”

Universit à degli Studi di Pisa Dipartimento di Informatica

Basics of Video Coding and H.263 Video Coding