Team Programming Project

Team Programming Project Byunghyun (Byung) Jang Ph.D student Northeastern University Jul. 26 2009 CRA-W/CDC Careers in High Performance Systems (CHiPS) Mentoring Workshop July 25-27 2009 National Center for Supercomputing Applications (NCSA) at University of Illinois at Urbana-Champaign (UIUC)

Some words about me • 4th year Ph.D student • Born and raised in South Korea • 34 years old (never too late to learn) • B.S. in mechanical engineering and M.S. in computer science • Full time engineer at Samsung Electronics for 3 years • GPGPU • Internship at AMD and fellowship from AMD • Happy

Goals • Understand General Purpose Computing on GPU (a.k.a. GPGPU) • Experience CUDA GPU programming • Understand how massively multi-threaded parallel programming works • Think about solving a problem in a parallel fashion • Experience the tremendous computational power of GPU • Experience the challenges in efficient parallel programming

Outlines • Application 1: Image Rotation • Introduction and Design (15 min) • Preparation (5 min) • Installing a skeleton code, compile test, image view test • Hands-on Programming (30 min) • Replace ??? with your own CUDA code • Application 2: Histogram • Introduction and Design (15 min) • Preparation (5 min) • Installing a skeleton code, compile test • Hands-on Programming (40 min) • Replace ??? with your own CUDA code • Conclusion

Application 1: Image Rotation- Introduction - • Rotate an image by a given angle • A basic feature in image processing applications Original Input Image Rotated Output Image

Application 1: Image Rotation- Introduction - • What the application does: Step 1. Compute a new location according to the rotation angle (trigonometric computation) Step 2. Read the pixel value of original location Step 3. Write the pixel value to the new location computed at Step 1 • Create the same number of threads as the number of pixels • Each thread takes care of moving one pixel • Our goals are • To understand how to use GPU for data parallelism • To know how to map threads to data

Application 1: Image Rotation- Design - 512 Treads Mapping 8 Thread Block (0, 0) Thread Block (0, 1) Thread Block (0, 63) 8 512 Thread Block (63, 0) Thread Block (63, 63)

Application 1: Image Rotation- Preparation - 1. Deploy the skeleton code in the proper directory [..@ac ~]$ cp /tmp/projects.tar ./ [..@ac ~]$ cp /tmp/cuda.pdf ./ [..@ac ~]$ tar -xf projects.tar 2. Request a cluster node for interactive use for 2 hours [..@ac ~]$ qsub -I -l walltime=02:00:00 3. Compile [..@ac ~]$ cd PROJECTS/projects/ImageRotation [..@ac ~]$ make clean [..@ac ~]$ make To use printf() to debug, use “make emu=1” instead of “make” 4. Execute [..@ac ~]$ ./ImageRotation 5. Convert image from “pgm” to “jpg” format [..@ac ~]$ convert data/lena_out.pgmdata/lena_out.jpg 6. Download “lena_out.jpg” to your laptop to view it Download for your future reference

Application 1: Image Rotation- Hands-on Programming - • Replace ??? in the skeleton code with your own CUDA code • Refer to the hints and comments in skeleton code • Talk to me if you have any questions or are done • Try to finish by 2:30 pm • Help others if you finish early

Application 2: Histogram- Introduction - • Shows the frequency of occurrence of the intensity value of each pixel • A commonly used analysis tool in image processing and data mining applications y-axis: Number of Pixels 0 (black) x-axis: Intensity 255 (white) Input Image Output Histogram

Application 2: Histogram- Introduction - • Serial implementation looks like • Access to data[] is sequential but access to histogram[] is random depending on the value, therefore, • We will use a fast shared memory to store per-block sub-histogram (s_hist[]) because shared memory handles random memory access much more efficiently than global memory does data[DATA_COUNT]; // input data histogram[BIN_COUNT]; // histogram data for (int i=0; i < BIN_COUNT; i++) histogram[i] = 0; // initialization for (int i=0; i < DATA_COUNT; i++) histogram[ data[i] ]++; // updating corresponding bin

64 data elements 64 data elements 64 data elements 64 data elements Application 2: Histogram- Design - • The structure of shared memory would look like the follow • Notice that shared memory is per thread block and limited data[DATA_COUNT] Shared Memory s_hist[]

Application 2: Histogram- Design - • Merging per-thread histogram into per-block histogram THREAD_N = 192 Shared Memory s_hist[] per block BIN_COUNT = 64 BIN_COUNT BIN_COUNT d_result[] # of thread blocks final histogram

Application 1: Image Rotation- Preparation - 1. Compile [..@ac ~]$ cd PROJECTS/projects/Histogram [..@ac ~]$ make clean [..@ac ~]$ make To use printf() to debug, use “make emu=1” instead of “make” 2. Execute [..@ac ~]$ ./Histogram 4. Check output message “*** TEST FAILED”: something wrong “*** TEST PASSED”: you got it

Application 1: Histogram- Hands-on Programming - • Replace ??? in the skeleton code with your own CUDA code • Refer to the hints and comments in skeleton code • Talk to me if you have any questions or are done • Try to finish by 3:30 pm • Help others if you finish early

Conclusions • What we’ve learned throughout the two projects • Understood a massive parallel computing on GPU • Experienced what CUDA programming looks like • Understood how to explicitly program hardware resources • Understood the importance and challenges in parallel programming • Experienced solving problem in massively parallel fashion • GPU is the platform of choice for data-parallel computationally- intensive applications • In a few years, we are likely to see many people buying a new graphics card to increase the desktop’s computing performance, not to increase 3D game performance

Thank you!

Team Programming Project

Team Programming Project

Presentation Transcript

Team Project

Project Team

Programming Team:

Project Team

Project team

Project team

Team Project

PROJECT TEAM

ACM Programming Team

Project Team

Project team

Project Team

Team Project

Project Team

Team Project

Project Team

Team Project

Project Team

Team Project

Project Team

Bash Shell System Programming II Team Project

Project team