Decoding Order Recovery: Challenges and Solutions

Decoding Order Recovery in draft-ietf-avt-rtp-svc-08 Ye-Kui Wang, Stephan Wenger, Miska Hannuksela {ye-kui.wang, stephan.wenger, miska.hannuksela}@nokia.com March 2008

Outline • Overview of RFC 3984 packetization modes • The problem to solve • Existing solutions in the draft • Issues of the existing solutions • A brief history • Other possible solutions • Possible ways to go

Overview of RFC 3984 packetization modes (1) • RFC 3984 supports 3 packetization modes: single NAL unit mode, non-interleaved mode, and interleaved mode • NAL (network abstraction layer) unit is the basic coded data unit in H.264/AVC and SVC • In single NAL unit mode and non-interleaved mode, • Decoding order of NAL units is equal to their transmission order • Decoding order is the order NAL units appear in the bitstream • One packet contains coded data only from a single access unit (identified by presentation/RTP/NTP timestamp) • As indicated in the next slide, in non-interleaved mode one packet may have more than one NAL unit in STAP-A • In interleaved mode, • Decoding order of NAL units may be different from their transmission order • Therefore, the decoding order number (DON) can be derived for each NAL unit • One packet may contain coded data from multiple access units

Overview of RFC 3984 packetization modes (2) • Single NAL unit mode allows • Single NAL unit packets (i.e. packets each containing only one NAL unit) • Non-interleaved mode allows • Single NAL unit packets • Single-time aggregation packets of type A (STAP-A) • Each STAP-A contains one or more NAL units from the same access unit • Packets each containing a fragmentation unit type A (FU-A) • An FU-A contains a part of a NAL unit • Interleaved mode allows • Single-time aggregation packets of type B (STAP-B) • Compared to STAP-A, STAP-B contains additionally DON information • A series of packets each containing a fragmentation unit, with the first packet containing an FU-B • Compared to FU-A, FU-B (fragmentation unit type B) contains additionally DON information • Multi-time aggregation packets (MTAPs) • Each MTAP contains one or more NAL units from one or more access units, as well as DON information and timestamp information for each NAL unit

The problem to solve (1) • In session multiplexing, coded data units (i.e. NAL units in SVC) of more than one layer are conveyed in more than one RTP session. • A receiver receiving more than one RTP session MUST feed the NAL units conveyed in all the sessions in decoding order to the decoder. • The decoding order recovery process reorders the received NAL units from reception order (after de-jittering) to decoding order.

The problem to solve (2) NALu_1_0 NALu_1_1 NALu_1_2 NALu_1_3 … NALu_0_0 NALu_0_1 NALu_0_2 NALu_0_3 … AU_1 AU_2 Session 1 Session 0 AU_0 CS-DON: 2 3 5 7 … CS-DON: 0 1 4 6 … IS-DON IS-DON: 0 1 2 3 … PTS: 0 0 1 2 … CS-DON: cross-session decoding order number (i.e. cross-layer DON or CL-DON in the draft)IS-DON: in-session decoding order number (same for both sessions in above)PTS: presentation timestamp (equal to NTP timestamp, same for both sessions in above)AU: access unit (identified by PTS or NTP timestamp)NALu: NAL unit An example is given below. The problem to solve is to ensure that the order the received NAL units shown below are sent to the decoder is: 0 1 2 3 4 5 6 7 … (denoted by the CS-DON values). In other words, the problem to solve is to figure out the CS-DON values of all the NAL units.

The problem to solve (3) • The following cases increases the difficulty of the problem (examples in the following slides). • Decoding order being different than presentation order • Transmission of layer specific non-VCL NAL units in either the RTP session or enhancement sessions • VCL (video coding layer) NAL units are those NAL units containing coded data of slices • One special case is the prefix NAL unit, which may or may not be a VCL NAL unit, depending on whether the succeeding NAL unit is a coded slice or filler data NAL unit, respectively • Another special case is NAL units containing coded slices of auxiliary coded pictures, which are specified as non-VCL NAL units • Other NAL units are non-VCL NAL units, including parameter set NAL units as well as SEI (supplemental enhancement information) NAL units • Packet losses • Use of session multiplexing with temporal scalability

The problem to solve (4) NALu_1_0 NALu_1_1 NALu_1_2 NALu_1_3 … NALu_0_0 NALu_0_1 NALu_0_2 NALu_0_3 … AU_1 AU_2 Session 1 Session 0 AU_0 CS-DON: 2 3 5 7 … CS-DON: 0 1 4 6 … IS-DON IS-DON: 0 1 2 3 … PTS: 0 0 21… CS-DON: cross-session decoding order number (i.e. cross-layer DON or CL-DON in the draft)IS-DON: in-session decoding order number (same for both sessions in above)PTS: presentation timestamp (equal to NTP timestamp, same for both sessions in above)AU: access unit (identified by PTS or NTP timestamp) NALu: NAL unit PTS or NTP timestamp order may be different than decoding order, as for AU_1 and AU_2 in below. Hence RTP timestamps (even initially set equivalent for different sessions) or NTP timestamps themselves do not indicate decoding order.

The problem to solve (5) NALu_1_0 NALu_1_1 NALu_1_2 NALu_1_3 … NALu_0_0 NALu_0_1 NALu_0_2 NALu_0_3 … AU_1 AU_2 Session 1 Session 0 AU_0 CS-DON: 1 3 5 7 … CS-DON: 0 2 4 6 … IS-DON IS-DON: 0 1 2 3 … PTS: 0 0 2 1 … CS-DON: cross-session decoding order number (i.e. cross-layer DON or CL-DON in the draft)IS-DON: in-session decoding order number (same for both sessions in above)PTS: presentation timestamp (equal to NTP timestamp, same for both sessions in above)AU: access unit (identified by PTS or NTP timestamp) NALu: NAL unit The CS-DON values of all the NAL units may also be as follows (with the order of NALu_1_0 and NALu_0_1 swapped compared to the previous slide) when NALu_1_0 is a parameter set NAL unit pertaining only to session 1, and must be as follows when NALu_1_0 is an SEI (supplemental enhancement information) NAL unit pertaining only to session 1. The order the received NAL units (denoted by the CS-DON values) shall be sent to the decoder is still: 0 1 2 3 4 5 6 7 …

The problem to solve (6) NALu_1_0 NALu_1_1 lost NALu_1_3 … NALu_0_0 NALu_0_1 NALu_0_2 lost… AU_1 AU_2 Session 1 Session 0 AU_0 CS-DON: 2 3 7 … CS-DON: 0 1 4 … IS-DON IS-DON: 0 1 2 3 … PTS: 0 0 2 1 … CS-DON: cross-session decoding order number (i.e. cross-layer DON or CL-DON in the draft)IS-DON: in-session decoding order number (same for both sessions in above)PTS: presentation timestamp (equal to NTP timestamp, same for both sessions in above)AU: access unit (identified by PTS or NTP timestamp) NALu: NAL unit The NAL units received for AU_1 and AU_2 may be as follows, with the absence of NALu_1_2 and NALu_0_3 that are present in previous slides, due to packet losses. As gap is allowed for consecutive CS-DON or IS-DON values, the order the received NAL units (denoted by the CS-DON values) shall be sent to the decoder is: 0 1 2 3 4 7 …

The problem to solve (7) AU_2 AU_3 AU_1 AU_0 NALu_0_0 NALu_0_1 … NALu_1_0 NALu_1_2 … Session 1 Session 0 CS-DON: 1 3 … CS-DON: 0 2 … IS-DON: 0 1 … IS-DON: 0 1 … IS-DON PTS: 2 1 4 3 … CS-DON: cross-session decoding order number (i.e. cross-layer DON or CL-DON in the draft)IS-DON: in-session decoding order number PTS: presentation timestamp (equal to NTP timestamp) AU: access unit (identified by PTS or NTP timestamp) NALu: NAL unit When the two sessions convey two temporal scalable layers, without packet losses, the situation can be as follows. The order the received NAL units (denoted by the CS-DON values) shall be sent to the decoder is: 0 1 2 3 …

Existing solutions in the draft • Two solutions to the problem are included in the draft. • The 1st solution in the draft is the so-called “classical RTP decoding order recovery mode”, described in subsection 8.1.1. • The 2nd solution in the draft is the so-called “CL-DON decoding order recovery mode”, described in subsection 8.1.2. • Short introductions and packetization rules for both existing solutions are included in subsection 7.1.

Idea of the existing solution 1 (1) • The 1st solution in the draft is the so-called “classical RTP decoding order recovery mode”, described in subsection 8.1.1, with an introduction and the packetization rules specified in subsection 7.1. • The idea is summarized as follows. • Utilize NTP timestamps and RTP timestamps to identify all NAL units belonging to any AU. • NTP timestamps can be derived for each NAL unit according to RTCP sender reports and RTP timestamps. • Non-AU-aligned NAL units are defined as those NAL units that exist in one session but there are no NAL units with the same NTP timestamp in another session, e.g. NALu_0_2 and NALu_1_3 in slide 10, and all the NAL units in slide 11). Other NAL units are referred to as AU-aligned NAL units. • Furthermore, type I non-AU-aligned NAL units are defined as those NAL units that exists in a lower session (session 0) but there are no NAL units with the same NTP timestamp in a higher session (session 1 ), e.g. NALu_0_2 in slide 10. Type II non-AU-aligned NAL units refer to those NAL units that exists in a higher session (session 1) but there are no NAL units with the same NTP timestamp in a lower session (session 0), e.g. NALu_1_3 in slide 10.

Idea of the existing solution 1 (2) • The idea is summarized as follows (continuing the previous slide). • The decoding order of AU-aligned NAL units of a same AU is decided according to IS-DON (which can be derived according to RFC 3984) and session dependency (signaled per draft-ietf-mmusic-decoding-dependency-01). • For example, in the previous examples, session 1 depends on session 0. Therefore, for NAL units within the same AU, except for SEI NAL units, the decoding order of any other type of NAL unit in session 1 shall be later than any NAL unit in session 0. SEI NAL units shall always precede any VCL (video coding layer) NAL unit in decoding order in the same AU. • The RTP sender MUST avoid type I non-AU-aligned NAL units in the sent packet stream, by, when needed, insertion of (possibly additional) video NAL units to the higher session with the same NTP timestamp, as shown in the following slide. Therefore, ideally, received NAL units should not contain type I non-AU-aligned NAL units. • For those type II non-AU-aligned NAL units, they are handled the same way as AU-aligned NAL units as if the lower session was not present. • The decoding order of NAL units in different AUs is decided according to IS-DON values of the highest session.

Idea of the existing solution 1 (3) AU_2 AU_3 AU_1 AU_0 NALu_0_0 NALu_0_1 … NALu NALu_1_0 NALu NALu_1_2 … Session 1 Session 0 CS-DON: 1 2 4 5… CS-DON: 0 3… IS-DON: 0 1 2 3… IS-DON: 0 1 … IS-DON PTS: 2 1 4 3 … CS-DON: cross-session decoding order number (i.e. cross-layer DON or CL-DON in the draft)IS-DON: in-session decoding order number PTS: presentation timestamp (equal to NTP timestamp) AU: access unit (identified by PTS or NTP timestamp) NALu: NAL unit Based on the example in slide 11, two NAL units were added to session 1 for AU_0 and AU_2. The bitstream including additionally the inserted NAL units must still be conforming to the SVC coding specification, as the inserted NAL units are also sent to the decoder. Thanks to the inserted NAL units, the decoding order of NAL units in the lower session (i.e. NALu_0_0 and NALu_0_1) in relative to the NAL units in the higher session (i.e. NALu_1_0 and NALu_1_2) can be know from IS-DON values of the higher session.

Idea of the existing solution 2 • The 2nd solution in the draft is the so-called “CL-DON decoding order recovery mode”, described in subsection 8.1.2, with an introduction and the packetization rules specified in subsection 7.1. • The idea is simply to signal the CS-DON values, such that the received NAL units are sent to the decoder in increasing order of their CS-DON values. • For an RTP session using the interleaved packetization mode, the DON (decoding order number) values can be derived according to RFC 3984 (the H.264 RTP payload format). The change (compared to RFC 3984) needed herein is to require that the DON values indicate CS-DON values. • For an RTP session using the non-interleaved packetization mode, each packet is required to contain a PACSI (payload content scalability information) NAL unit that contains a CS-DON value, which allows for derivation of CS-DON values for all NAL units in the packet.

Issues of the existing solution 1 (1) • Solution 1 has the following constraints. • Con#1.1: The error resilience problem. • The RTP receiver has to discard some received video NAL units, when type I and type II non-AU-aligned NAL units are neighboring to each other in (cross-session) decoding order, due to packet losses. • An example is NALu_0_2 and NALu_1_3 in slide 10. These two NAL units have to be discarded, as there is no way for the receiver using this solution to know the relative decoding order between them. • More NAL units may have to be discarded in complicated cases. • From an architecture point of view, handling (including discarding) of received video NAL units should be processed by the video decoder (not the RTP receiver). • But this is an application issue, the RTP receiver will notice packet loss and will need, in both cases, to either re-synchronize or inform the decoder of the loss. • There are applications like voice activated video switch MCUs that will discard RTP packets while waiting for Intra frame. • The exact process on how to discard NAL units (i.e. which NAL units to be discarded to ensure the correct decoding order recovery of the rest of the received NAL units) remains unclear in the draft. • The topic of error resilience is not discussed for both solution and the handling of packet loss is dependent on the mechanism used.

Issues of the existing solution 1 (2) • Solution 1 has the following constraints (continuing the previous slide). • Con#1.2: The NAL unit generation and insertion problem. • The RTP sender has to support generation and insertion of video NAL units to avoid type I non-AU-aligned NAL units. • The RTP sender needs to understand fully the inserted video NAL units to be able to generate the NAL units. • These additional NAL units may make the received bitstream non-conforming to the SVC coding specification because of conflicts in buffering (i.e. HRD - hypothetical reference decoder) parameters. • In addition, the process itself is under specified in draft, in the following aspects. • What type of NAL units are to be added? • How to set the fields or syntax elements in RTP header, payload header, NAL unit header and the NAL unit payload of each inserted NAL unit?

Issues of the existing solution 1 (3) • Solution 1 has the following constraints (continuing the previous slide). • Con#1.3: The initial delay problem. • The process can only start after all sessions have received at least one RTCP sender reports to derive NTP timestamps for each AU or NAL unit. • This initial delay has at least two implications. • Buffering of packets is needed for each session even when the single NAL unit packetization mode or the non-interleaved packetization mode is in use. • The delay may be too long for low-delay applications such as video telephony and video conferencing. • However, if there is a problem then it is an RTP issue and should be solved in a non payload specific document.

Issues of the existing solution 2 • Solution 2 has the following constraints. • Con#2.1: It cannot be used with the single NAL unit packetization mode. • Can be solved by sending PACSI in a single NAL unit packet (see the next two slides) • Con#2.2: When it is used with the non-interleaved packetization mode, single NAL unit packet and fragmentation unit type A (FU-A) cannot be used. • Can be solved by sending PACSI in a single NAL unit packet (see the next two slides) • Con#2.3: It adds more data for the non-interleaved packetization mode in all cases, while solution 1 only adds more data for the non-interleaved packetization mode only in some cases, e.g. when purely temporal scalable layers are in use)

Comments on issues of the existing solution 2 • Comments on Con#2.1 • The need for support of the single NAL unit packetization mode needs further study. • It is still possible to use the non-interleaved packetization mode and to encapsulate only one video NAL unit into one packet, together with a small PACSI NAL unit containing the CS-DON information. • From earlier AVT discussions, nobody is aware of existing implementations of RFC 3984 that are capable of doing multicast AND use only the single NAL unit packetization mode. On the other hand there are such implementations using unicast. • Comments on Con#2.2 • The need for use of single NAL unit packets needs further study, see above. • The need for use of FU-A packets is unclear. RFC3984 lists only a few cases where this mode is advantageous over IP-layer fragmentation. We are unaware of any deployment.

Suggested modifications to solution 2 • This solution can be extended to allow the use of the single NAL unit packetization mode, as follows. • Allow for packets to contain only a PACSI NAL unit. RFC 3984 receivers will ignore PACSI NAL units (with NAL unit type 30) according to Table 3 of RFC 3984. • For the NAL units belonging to one access unit in the RTP session, the packet carrying the first NAL unit in transmission order MUST be preceded with a packet containing only a PACSI NAL unit, the rest MAY be preceded with a packet containing only a PACSI NAL unit. • The CS-DON value indicated in the PACSI NAL unit applies to the subsequent NAL unit in transmission order. For each following NAL unit in transmission order until the next PACSI NAL unit, the CS-DON value is equal to the CS-DON value of the previous NAL unit in transmission order incremented by 1. • Similarly as the existing solution 1, this extension has an error resilience problem. When a packet containing only a PACSI NAL unit immediately precedes the first NAL unit of an access unit in the RTP session and is lost, the CS-DON value of the NAL unit in the subsequent packet in transmission order cannot be derived. Similarly as the existing solution 1, the RTP receiver has to discard those NAL units that the CS-DON values cannot be derived.

Other possible solutions A generic solution Signaling of CS-DON values or equivalent that enables the derivation of CS-DON values in RTP header extension. This solution, if viable, would be suitable for all the current and future scalable media codecs. However, this was basically concluded as not viable because since RFC 1889 there has been a requirement that RTP header extension can be ignorable without affecting correctness. Combination of the two existing solutions for the non-interleaved packetization mode Comb#1: Receivers are required to support the 1st solution even PACSI NAL units carrying CS-DON information are present, meaning that insertion of video NAL units to avoid type I non-AU-aligned NAL units in sent NAL unit stream is always needed, though that is not needed by the 2nd solution. Comb#2: An alternative is to mandate both inclusion of PACSI NAL units carrying CS-DON information and insertion of video NAL units to avoid type I non-AU-aligned NAL units in sent NAL unit stream, such that the receiver can freely choose which solution to use. However, there is no detailed text specifications available for either of the above, and it is unclear whether either way will work for all cases. For example, is it allowed to have a packet containing only a PACSI NAL unit? It is allowed to have single NAL unit packet? How to ensure that CS-DON values can be derived for all NAL units, including the inserted NAL units?

Possible ways to go There are at least the following alternative ways to go Take only the existing 1st solution Take only the exiting 2nd solution Take only the existing 2nd solution plus the extension in slide 22 Take a combined solution as follows Use the existing 1st solution for the single NAL unit packetization mode Use Comb#1 in slide 24 for the non-interleaved packetization mode (currently unclear whether it always works) Use the existing 2nd solution for the interleaved packetization mode Take a combined solution as follows Use the existing 1st solution for the single NAL unit packetization mode Use Comb#2 in slide 24 for the non-interleaved packetization mode (currently unclear whether it always works) Use the existing 2nd solution for the interleaved packetization mode Take both existing solutions and let the use be negotiated

Decoding Order Recovery: Challenges and Solutions