240 likes | 390 Vues
Explore the significance of the Discrete Cosine Transform (DCT) in image processing, compression (JPEG, MP3), scientific analysis, audio processing, and high-performance computing. Learn about its mathematical basis, algorithm walk-through, CUDA implementation, testing on GPU platforms, and future work. References included.
E N D
Christopher Mitchell CDA 6938, Spring 2009 The Discrete Cosine Transform (DCT)
The Discrete Cosine Transform • In the same family as the Fourier Transform • Converts data to frequency domain. • Represents data via summation of variable frequency cosine waves. • Since it is a discrete version, conducive to problems formatted for computer analysis. • Captures only real components of the function. • Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful. • Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.
Significance / Where is this used? • Image Processing • Compression - Ex.) JPEG • Scientific Analysis - Ex.) Radio Telescope Data • Audio Processing • Compression - Ex.) MPEG – Layer 3, aka. MP3 • Scientific Computing / High Performance Computing (HPC) • Partial Differential Equation Solvers
Significance, Cont. • Image Processing Example • Exhibits Energy Compaction • Drop small amplitude coefficients Original Image DCT Transformed Image
Implementation Platform NVIDIA CUDA Version 2.0
Implementation Platform, Cont. • What Happened to the Cell/BE? • Too many technical challenges compared to the deadline. • Algorithm is embarrassingly parallel • Conducive of launching hundreds of threads → GPU • Algorithm requires too much data per pass compared to local store size. • Would have to be creative with DMA and no guarantee of bottleneck mitigation.
Algorithm Walk Through • Mathematical Basis • 1D Version: • Where: • 2D Version: • Where α(u) and α(v) are defined as shown in the 1D case.
Algorithm Walk Through • CPU Version – 1D DCT
Algorithm Walk Through • CPU Version – 2D DCT
Algorithm Walk Through • Problem • 1D DCT is O(n2) • 2D DCT is O(n3) • Additionally, the Algorithm uses calls to calculate the cosine and square root. • Long Latency ALU Operations
Algorithm Walk Through • CUDA Version – 1D DCT
Algorithm Walk Through • CUDA Version – 2D DCT
Algorithm Walk Through • Solution • 1D DCT is now O(n) • 2D DCT is now O(n2) • Parallelization key to success with this algorithm
Testing • Platform • Intel Core 2 Duo E6700 @ 2.66 GHz. • Gigabyte GA-P35-DQ6 Motherboard • 2 GB RAM • 2 NVIDIA GeForce 8600 GTS Superclocked GPUs • 720 MHz. Core Clock • 256 MB GDDR3 Memory • 4 Multiprocessors → 32 Streaming Processors • Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare 178.24 Drivers
Future Work • Multiple GPU version • Have a dual card setup to test this with. • Need to find efficient way to split the problem between the two cards without incurring a large I/O penalty. • Still interested in trying a Cell/BE version of the algorithm. • Need to improve at CBEA programming. • DMA & local store size is the limiting factor for this particular problem.
References • NVIDIA CUDA Programming Guide, Version 2.1 • http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf • The Discrete Cosine Transform (DCT): Theory and Application • http://www.egr.msu.edu/waves/people/Ali_files/DCT_TR802.pdf • CDA 6938 Lecture Notes and Slides