1 / 49

A near real time decoding for LDPC based distributed video coding using CUDA

A near real time decoding for LDPC based distributed video coding using CUDA. CUDA 架構下針對 低密度奇偶校驗碼為基礎之分散式編碼的 近即時解碼設計. Su, Tse -Chung 蘇則仲 Advisor: Prof. Wu, Ja -Ling 吳家麟 教授 2011/6/9. Outline. Motivation and Introduction LDPC decoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA

archer
Télécharger la présentation

A near real time decoding for LDPC based distributed video coding using CUDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A near real time decoding for LDPC based distributed video coding using CUDA CUDA架構下針對低密度奇偶校驗碼為基礎之分散式編碼的近即時解碼設計 Su, Tse-Chung 蘇則仲Advisor: Prof. Wu, Ja-Ling 吳家麟 教授 2011/6/9 CMLab, CSIE, NTU

  2. Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using CUDA Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

  3. Conventional Video Codec Heavyweight ENCODER Lightweight DECODER MPEG-2, H.264, HEVC(H.265) CMLab, CSIE, NTU

  4. Distributed Video Coding(DVC) Heavyweight Lightweight ENCODER DECODER A new paradigm for video compression CMLab, CSIE, NTU

  5. Application of DVC Cloud Computational Resource H.264 encoded bitstream Realtime system DVC to H.264 Transcoder DVC encoded bitstream H.264 decoder (Low Complexity) DVC encoder (Low Complexity) Video conferencing with mobile devices CMLab, CSIE, NTU

  6. Distributed Video Coding LDPC Encoder Channel Decoder LDPC Decoder Channel Encoder D. Varodayan, A. Aaron, and B. Girod, “Rate-Adaptive Codes for Distributed Source Coding,”EURASIP Signal Processing Journal, Special Issue on Distributed Source Coding,,November 2006.

  7. Decoding Complexity of DVC Heavyweight DECODER • Our DVC codec (state-of-the-art) • Parallelized with OpenMPand CUDA • 12 core + GPGPU(Fermi) • ~1FPS CMLab, CSIE, NTU

  8. Amdahl's law QCIF 29%~36% 15.39 fps 86%~94% • Maximum speedup can be reached by improving the most critical part of the system • LDPC decoding in the DVC decoder. CMLab, CSIE, NTU

  9. Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using CUDA Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

  10. LDPC decodingSum-Product Algorithm(Message Passing) Side Information (real number) +  0 -  1 a b c d e f g decode output hard decision 1 2 3 4 5 6 7 甲 乙 丙 a25 b25 c25 d25 e25 f25 g25 0 1 1 a1 b1 c1 d1 e1 f1 g1 a b c d e f g 4 6 7 1 2 3 5 Horizontal processing Vertical processing 甲 乙 丙 0 1 1 From DVC encoder (syndrome bits) Kschischang, F.R., Frey, B.J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm.IEEE Trans. Inform. Theory

  11. Sum-Product AlgorithmVertical Processing b c d e f g a K + F + a P = 0 1 1 A B C D E P F + P + a Z = F G H I J K L M N O Z CMLab, CSIE, NTU

  12. Sum-Product AlgorithmHorizontal Processing a b c d e f g 0 1 1 P Q R S T H K U V W X Y Z A B C D CMLab, CSIE, NTU

  13. LDPC Accumulate (LDPCA) codes Rate adaptivity D. Varodayan et al., "Rate-adaptive codes for distributed source coding," EURASIP Signal Processing Journal, Special Section on Distributed Source Coding, 2006

  14. 65 LDPC codes 3

  15. Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA (Kernel Design) Early Stop Detection Mechanism Using CUDA Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

  16. Previous CUDA implementation • Vertical Processing Kernel (VPK) • Column degree is constant  3 • regular LDPC • Shared memory CUDA thread 3 Shared Memory Shared Memory CUDA thread Block(0) CUDA thread Block(1) Message data 0 1 2 3 4 5 6 7 8 9 10 11 A B C D E Index data F G H 0 1 2 3 4 5 6 7 8 9 10 11 1 4 0 6 7 5 9 10 11 8 2 3 5 8 I J K L Pai, Y.-S., Cheng, H.-P., Shen, Y.-C. and Wu, J.-L. 2010. Fast decoding for ldpc based distributed video coding. In Proc. of ACM international conference on Multimedia • Horizontal Processing Kernel (HPK) • Each message can be update by one thread (SIMD) • Variable row degree in each LDPC code • Data structure: Circular link list

  17. CUDA ImplementationStrategy 1 Reduction of Φ Function in HPK Texture memory in VPK

  18. Texture Binding in VPK t0 t0 Global Memory A D E F G H I J K L B C Global Memory Non-coalescing read Texture Binding 5 8 29 Speedup on both 1.x and 2.x compute capability CMLab, CSIE, NTU

  19. LDPCA decoding time in previous CUDA implementation … … … CMLab, CSIE, NTU

  20. Reduction of Φ Function in HPK A D E F G H B C 1 4 0 6 7 5 2 3 t3 t5 t6 t2 t4 t7 t1 t0 Shared Memory CUDA thread Block(0) Copy to shared memory A D E F G H I J K L B C Global Memory 1 4 0 6 7 5 9 10 11 8 2 3 CMLab, CSIE, NTU

  21. Reduction of Φ Function in HPK 1 4 0 6 7 5 2 3 t0 t1 t2 t3 t5 t6 t7 t4 Shared Memory CUDA thread Block(0) Calculate functions before copying to shared memory A D E F G H I J K L B C Number of φ(x): row degree Global Memory 1 4 0 6 7 5 9 10 11 8 2 3 2 CMLab, CSIE, NTU

  22. LDPCA Performance -- foreman sequence (QCIF) StepSpeedup CumulativeSpeedup LDPCA Time

  23. CUDA ImplementationStrategy 2 ParallelPartial Reduction in HPK

  24. Parallel Reduction Values (shared memory) Step 1 Stride 8 Thread IDs 0 1 2 3 4 5 6 7 Values Step 2 Stride 4 Thread IDs 0 1 2 3 Values Step 3 Stride 2 Thread IDs 0 1 Values Step 4 Stride 1 Thread IDs 0 Values Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Sequential addressing is conflict free CMLab, CSIE, NTU

  25. Computation Overlapping in HPK t2 t1 t3 t4 t0 Parallel partial reduction = Magnitude CMLab, CSIE, NTU

  26. Parallel Partial Reduction Global Memory rowDeg = 8 rowDeg = 4 message index t3 t9 t0 t1 t4 t8 t10 t2 t0 t1 t2 t3 t0 t9 t8 t1 t2 t3 t8 t9 Log(rowDeg) = 3 t0 t1 t0 t1 t8 t8 Shared Memory CUDA thread Block(0) t0 t0 idle threads Mag0 Mag1 CMLab, CSIE, NTU

  27. CUDA ImplementationStrategy 3 Check Node Re-ordering • Completely Unrolling

  28. Check Node Re-ordering Shared Memory 3 3 CUDA thread Block(1) CUDA thread Block(0) rowDeg =8 rowDeg = 4 rowDeg = 8 3 CUDA thread Block(0) CUDA thread Block(1) 2 Variable node Variable node 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 2 1 0 Check node Check node CMLab, CSIE, NTU

  29. Completely unrolling • Redundant if else& __syncthreads() inti = threadIdx.x; Int half = rowDeg >> 1; float myMag = s_mag[i] ; char mySign = s_sign[i] ; do{ if(rowPos < half){ s_mag[i] += s_mag[i+half]; s_sign[i] ^= s_sign[i+half]; } half >>= 1; __syncthreads(); }while(half); Int base = i - rowPos; myMag = s_mag[base] - myMag; mySign = s_sign[base] ^ mySign; inti = threadIdx.x; float myMag = s_mag[i] ; char mySign = s_sign[i] ; If(rowDeg==16){ s_mag[i] += s_mag[i+8]; s_sign[i] ^= s_sign[i+8]; s_mag[i] += s_mag[i+4]; s_sign[i] ^= s_sign[i+4]; s_mag[i] += s_mag[i+2]; s_sign[i] ^= s_sign[i+2]; s_mag[i] += s_mag[i+1]; s_sign[i] ^= s_sign[i+1]; } else if ( rowDeg == 8 ){ ….. } int base = i - rowPos; myMag = s_mag[base] - myMag; mySign = s_sign[base] ^ mySign; Branch divergence Nobranch divergence harm performance Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology

  30. CUDA ImplementationStrategy 4 • Combination of VPK and HPK

  31. Kernel Launch Overhead VPK VPK VPK HPK HPK HPK 1. Parallelism is broken (Implicit Inter-Block Synchronization) 2. Extra global memory traffic VPK+HPK =UMK VPK+HPK =UMK VPK+HPK =UMK NVIDIA CUDA PROGRAMMING GUIDE(3.2)  5.2.1 CMLab, CSIE, NTU

  32. LDPCA Performance -- foreman sequence (QCIF) StepSpeedup CumulativeSpeedup Time

  33. Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using CUDA (CUDAAPI) Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

  34. CUDA ImplementationStrategy 5 Early Stop Detection

  35. Early Stop Detectionin Sum-Product Algorithm time . . . CPU SPA iteration 1 E SPA iteration 2 SPA iteration 100 Horizontal Processing + Vertical Processing Early stop detection time Terminated at iteration 30 Successfully decoded Converge to wrong codeword iter.1 Check iter. 1 iter.2 Check iter. 2 . . . GPU UMK UMK UMK UMK UMK UMK UMK UMK UMK UMK EDK PCI-E transfer EDK PCI-E transfer CPU Transmit codeword&decoded info Early stop Detection Kernel Check decode info CMLab, CSIE, NTU

  36. Combination of EDK and UMK a b c d e f g 0 1 1 time . . . GPU UMK UMK UMK UMK UMK UMK UMK UMK EDK PCI-E transfer EDK PCI-E transfer CPU • The SPA algorithm is memory intensive in CUDA • The index data of UMK is also used by early stop detection (EDK) • EDK+UMK = EDUMK • 14% additional complexity in terms of execution time CMLab, CSIE, NTU

  37. Concurrent Kernel Execution and Data Transfer Early Stop Detection for iter.1 Run UMK for iter.2 Early Stop Detection for iter.5 Run UMK for iter.6 iter.9 iter.3 UMK EDUMK UMK UMK UMK EDUMK UMK UMK UMK GPU iter.1 PCI-E transfer PCI-E transfer Receive decode info &codeword for iter.5 Receive decode info & codeword for iter.1 time CPU Ideal Timeline CMLab, CSIE, NTU

  38. Practical CUDA Implementation for Early Stopping Detection time #overlap = 3 #overlap = 3 UMK EDUMK Stream 0 EDUMK EDUMK . . . UMK UMK UMK UMK UMK UMK Stream 1 PCI-E transfer PCI-E transfer Stream 2 ~~~~~~~~~ ~~~~~~~~~ host Explicitsynchronization • Use 1 CPU thread, 1 GPU • Use CUDA Driver API instead of Runtime API • Nearly no Stream Management instructions • cudaStreamSynchronize(), cudaStreamQuery(), or cudaStreamWaitEvent() CMLab, CSIE, NTU

  39. Speed-up ratio of early stop detection Total number of LDPCA iterations Theoretical speedup Actual speedup Early stop detection Fix iteration overhead 2.0x 1.8x 10% 20000 10000 Overhead on GPU Using Runtime API Overhead on GPU Using driver API Overhead on CPU 7% 5% 20% CMLab, CSIE, NTU

  40. LDPCA Performance -- foreman sequence (QCIF) StepSpeedup CumulativeSpeedup Time 449.63x faster than sequential program!

  41. Outline Motivation and Introduction LDPCdecoding & LDPCA in DVC Parallel LDPCA Decoding In CUDA Early Stop Detection Mechanism Using CUDA Evaluation of Decoding speed Conclusions and future work CMLab, CSIE, NTU

  42. Test condition • 12 CPU, 24 processor • Intel(R) Xeon(R) CPU X5650 @ 2.67GHz • GPU: Tesla M2050 • 14 (MP) x 32 (Cores/MP) = 448 (Cores) • CUDA capability 2.0 • Shared memory: 48K • Maximum threads in block: 1024 • Concurrent copy and execution • Concurrent kernel execution

  43. Test condition Soccer Foreman Coastguard Hall Monitor Low High Motion Test sequences: QCIF, 15Hz, all frames GOP size: 8 Qindex: 8 Bitrate and PSNR: only luminance component CMLab, CSIE, NTU

  44. Speedup Ratio of LDPCA decoder Using CUDA 0.2% bit rate↑ 0.96 fps 1.05 fps 15.35 ↑ LDPCA 22.51 ↑ LDPCA 7.43 ↑ 9.8 ↑ 7.14 fps 10.29 fps 1.14 fps 0.79 fps 36.91 ↑ LDPCA 12.88 ↑ LDPCA 13.5 ↑ 6.32 ↑ 15.39 fps 4.99 fps

  45. LDPCA decoding time comparison Ryanggeun, O., Jongbin, P. and Byeungwoo, J. 2010. Fast implementation of wyner-ziv video codec using gpgpu. In Proc. of IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, 1-5.

  46. Realtime Decoding Quality Original Sequence 29.21db, 93.17kbps Original Sequence 27.44db, 76kbps Original Sequence Original Sequence 39.46db, 147.64kbps 35.34db, 263.52kbps

  47. Conclusion Fully parallelized LDPCA decoder using CUDA with various features The proposed early stop detection mechanism reduces the latency between the CPU and the GPU Videos in surveillance sequence (e.g. hall monitor) can be decoded in real-time with negligible RD performance loss CMLab, CSIE, NTU

  48. Future Work Soft input a1 b1 c1 d1 e1 f1 g1 a2 b2 c2 d2 e2 f2 g2 a3 b3 c3 d3 e3 f3 g3 4 6 7 1 2 3 5 Vitor Silva Horizontal processing Vertical processing 1 2 3 03 13 03 02 12 02 01 11 01 syndrome • Bitplane level parallelization for LDPCA • UV component • Frame level parallelization

  49. Thank You CMLab, CSIE, NTU

More Related