1 / 9

JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems

JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems. Ang Li University of Wisconsin-Madison. Outline. Brief Introduction of Background Implementation Evaluation Conclusion. Background. JPEG Encoding Parallelism Seeking Pre-processing: Color Conversion Block Encoding/Decoding.

linore
Télécharger la présentation

JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. JPEG-GPU: a GPGPU Implementation of JPEG Core Coding Systems AngLi University of Wisconsin-Madison

  2. Outline • Brief Introduction of Background • Implementation • Evaluation • Conclusion NVIDIA GTC 2013

  3. Background • JPEG Encoding • Parallelism Seeking • Pre-processing: Color Conversion • Block Encoding/Decoding NVIDIA GTC 2013

  4. Implementation • Step 1 – Find target functions • Encode: encode_mcu_huff, encode_one_block, emit_bits_s • Decode: decode_mcu_DC_first, decode_mcu_DC_refine • Profiling to find other functions • Using GPROF • Encode: rgb_ycc_convert • Decode: ycc_rgb_convert • Both take small half of the total execution time of encoding/decoding NVIDIA GTC 2013

  5. Implementation – Cont’d • Step 2 – Parallel with CUDA • First, implementing in OpenMP to make sure the understandings are correct • E.g., in encode_one_block, emit_bits_s changes the state of system => parallel with multiple threads will lead to incorrect results! • Secondly, make a baseline GPGPU implementation to all critical functions • Thirdly, optimize GPGPU implementations • Using constant memory for (k = 1; k <= Se; k++) { … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … if (! emit_bits_s(…)) return FALSE; … } NVIDIA GTC 2013

  6. Evaluation • Evaluation Environment • CPU: Intel Nehalem Xeon E5520 2.26GHz processor • GPU: Tesla K20c • Picture used • My favorite picture • Compressing: 1280 x 768 pixels • Decompressing: the products after compressing • Correctness checked by ``diff’’ NVIDIA GTC 2013

  7. Evaluation – Cont’d • Timings are in milliseconds, averagin 10 times of execution • Four threads are forked for OpenMP implementation • For both GPU implementations, configurations are tuned to be optimized • Results discussion • OpenMP is fastest. GPGPU basically degrades the performance  while `optimized’ version degrades more (due to serialized constant memory accesses). • Observations after hacking the code: • Each kernel launch deals with at most 250 elements, too fine-grained. • Kernel launch is expensive (allocation & copying the data) • Using OpenMP is always going to better off as long as there is enough parallelism & loop iterations are not extremely trivial. NVIDIA GTC 2013

  8. Conclusion • For JPEG encoding/decoding core system, GPGPU basically degrades the performance. • Coarser-grained parallelism is required. • OpenMP acceleration can be easily applied to gain some performance. NVIDIA GTC 2013

  9. Thank you. Ang Li <ali28@wisc.edu> NVIDIA GTC 2013

More Related