Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee PowerPoint Presentation
Download Presentation
CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee

CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee

151 Views Download Presentation
Download Presentation

CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Parallelizing MPEG-2 with CUDA CS315A March 9, 2009 Michael Spradlin Rhyland Klein Jong Su Lee Pradeep Joginipally

  2. Outline • Motivation • Why use CUDA? • Why use MPEG-2? • MPEG-2 overview • CUDA overview • CUDA + video encoder implementation • Preliminary results, looks promising

  3. Motivation • Video encoding is becoming a very resource intensive application for today’s desktop user • To encode a video is still a time consuming prospect • Advent of GPUs and CUDA technology • Video encoder is a highly parallel application • Make use of highly parallel GPU resources • Found very little work done to port video encoders onto CUDA platform • Check out how CUDA platform performs for encoders

  4. MPEG-2? • We chose MPEG-2 over MPEG-4/H.264 • Initially going to use ffmpeg for H.264 encoding • Difficulties in incorporating CUDA library with ffmpeg • None of us have experience working with CUDA or encoders • Just wanted to get started, not bogged down with setting up a build environment • MPEG-2 still relevant (DVD encoding) • Chose mpeg2encoder reference library

  5. MPEG-2 Overview: Frame Types • GOP (Group Of Picture) • I frame : Intra frame • Independently decodable. • P frame : Predictive frame • Refer to previous I and P frame • B frame : Bi-predictive Inter frame • Refer to both previous and later frames

  6. Sample Input • Elephant’s Dream, open source animated short • Used by PARSEC h264 benchmark • 15,691 .png images, converted to .yuv format

  7. How to compress and encode • Human eye has limited ability to fully perceive • Reduces or completely discards information in certain frequencies and areas • Color Space Y’CbCr • Y’ : Luma • Cb : Chroma Blue • Cr : Chroma Red • Human eye is much less sensitive to small changes in color than in brightness • 4 Y’ 1Cb 1Cr

  8. Detail Encoding Structure Input Video read parm_file read quant_matrix init rate control init _seq put seqhdr Loop through all frames in encoding order frame reordering rate control Init_GOP put GOP header read frame frame/ field motion estimation (1) predict (6) dct_type estimation (7) transform (2) sub_pred fdct put picture (3)

  9. Detail Encoding Structure (‘Cont) rate control int_pict put pict_hdr prev_ mquant Loop macroblock put slice hdr determine mquant MB quantiz macro_block mode put motion vector block decode rate control update pic inverse quantization (5) itransform (4) idct add pred calc_snr store frame putseqend Result

  10. Main Functions Motion Estimation (75.3 %) Calculate Prediction (0.9 %) DCT Type Estimation (0.006 %) Subtract Prediction from Picture and Perform DCT (14.7 %) Quantize DCT coefficients and Generate VLC data (5.4 %) Inverse Quantize DCT coefficients (0.9 %) Perform IDCT and Add Prediction (1.5 %) DCT : Discrete Cosine TransformVLC : Video Lan Client

  11. Unit Hierarchy Hierarchy • Sequence layer • GOP layer • Picture layer (40x24) • Slices (Frame) • Macroblock (16x16) • Block (8x8 pixels)

  12. Motion Estimation & Predict

  13. CUDA • Nvidia Architecture • GPU massively parallel • Flexible

  14. CUDA cont… • Exposes API to GPU • Many Stream processors • More compute, less cache • SIMT • Conditions expensive!

  15. CUDA cont… • Data Decomposition • Threads • Blocks • Grids

  16. CUDA cont… • Memory complex • Device Memory • Shared Memory • Local Memory • Host  Device expensive

  17. Implementation • Encoders can be parallelized at several concurrency granularities • Per GOP data • Per frame data • Per logical computational task • CUDA is vastly task concurrency focused • SIMT nature leads to implementation based on individual kernels of computation tasks

  18. Example: Inverse Quantization dst[0] = src[0] << (3-dc_prec); for (i=1; i<64; i++) { val = src[i] * quant_mat[i] * mquant / 16; /* mismatch control */ if ((val&1)==0 && val!=0) val += (val>0) ? -1 : 1; /* saturation */ dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); } • P frames need to be un-quantized to serve as residual references for future frames in motion estimation. • Done for 960 macroblocks in a frame, for 6 blocks per macroblock, for 64 pels (pixels) per block. • 368,640 iterations per frame • Simplified inverse quantization kernel code:

  19. Step 1: CUDA Data Distribution • 960 Macroblocks • 940 Blocks in a 1-D grid • Macroblock ref is z coord • 6 Blocks x 64 Pels (pixels) • 384 2-D threads • Block ref is y coord • Pel ref is x coord

  20. Calling Into the GPU Device dim3 dimGrid(mb_per_frame); dim3 dimBlock(block_per_mb, pel_per_block); iquant1_device<<<dimGrid,dimBlock>>>(...);

  21. Step 2: CUDA-ize the Kernel • The Good • val = src[i] * quant_mat[i] * mquant / 16 • The Bad • if ((val&1)==0 && val!=0) • dst[0] = src[0] << (3-dc_prec) • The Ugly • val += (val>0) ? -1 : 1 • dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val);

  22. CUDA-ize: The Good • Copy all blocks, the quantization matrices, and the metadata for all macroblocks to GPU device memory • i = threadIdx.x (pel) • j = threadIdx.y (block) • k = blockIdx.x (macroblock) Sequential: val = block[i] * quant_mat[i] * mquant / 16 CUDA: val = blocks_d[k*block_count+j][i] * quant_mat_d[i] * mbinfo_d[k].mquant) >> 4

  23. CUDA-ize: The Bad • Branches waste time by causing redundant computation on the same data: • if ((val&1)==0 && val!=0) • Special handling of certain elements leads to conditionals: dst[0] = src[0] << (3-dc_prec); if (i == 0){ blocks_d[k*block_count+j][0] = ...

  24. CUDA-ize: The Ugly • Some conditional branches can be eliminated by use of max and min, which CUDA handles built-in in 4 cycles: val += (val>0) ? -1 : 1 dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); incr = max(val,-1); incr = min(incr,1); val -= incr; val = min(val,2047); blocks_d[k*block_count+j][i] = max(val,-2048);

  25. Inverse Quantization Kernel Results • Input: • First 1000 frames of Elephant’s Dream • 640x360, 30 frames/sec, 144 kbps, NTSB • 4:2:0 chroma format, 4 frames per GOP, IBPB • 57x – 61x improvement!!! • Over a sequential version on a 2.33GHz Xeon • For the inverse quantization kernel

  26. But That’s a Lie! • This didn’t take the time to copy data in and out of GPU device memory into account. • 2-3x slower in the quantization kernel • 00.5% slower total encoder program • Luckily, terrible CUDA memory access time can be ameliorated over all computational kernels. cudaMemcpy(mbinfo_d, mbinfo, ..., cudaMemcpyHostToDevice); cudaMemcpy(blocks_d, blocks, ..., cudaMemcpyHostToDevice); gettimeofday(&start_time,NULL); iquant1_device<<<dimGrid,dimBlock>>>(...); gettimeofday(&end_time, NULL); quant_time += ... cudaMemcpy(blocks, blocks_d, ..., cudaMemcpyDeviceToHost);

  27. Final Thoughts • CUDA: The not so general GPGPU • No function pointers, no recursion • Very limited registers and cache space • The mystery of OpenMP, the attention to detail of Pthreads • Optimization requires intimate understanding of grid and block structures • We haven’t even gotten into memory usage tweaking • Still, remarkable results attainable with correctly structured program and enough effort