300 likes | 419 Vues
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh. Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer Engineering – Prof. Olivera Notaros.
E N D
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer Engineering – Prof. Olivera Notaros
Project Goals: To develop parallel versions of applications that will run on a graphics card and measure the performance. • Started with a simple Matrix Multiply program. • We intend to develop at least one or two additional applications and also to pursue an analysis of hardware optimizations. • Develop a process for tuning applications & hardware that other developers can use more easily.
Tyler Drake – Computer Science major • Robert Wrisley – Computer Science/Computer Engineering dual major • Kyle Von Koepping – Electrical Engineering major • Justin Walsh – Computer Science/Computer Engineering dual major • Shared coding responsibilities • Enables comparison and greater understanding for all team members • Possibly divide responsibilities for the second half of the project
Moore’s Law • Transistor densities on single-core processors were doubling approximately every 18 months. • This trend has remained valid since first observed in 1965 and is expected to hold for several more years. • This natural trend had become the standard goal for hardware companies.
Limits of Moore’s Law • There is an ultimate limit to Moore’s law. • Transistors will soon reach sizes of atomic level. • Moore’s law does not apply to Random Access Memory (RAM) speeds and hard drive seek times. (AKA Memory Wall) • Redesign of processor architecture isn’t driven directly by Moore’s Law, but by the fact that these and other factors have not kept up with this growth rate.
The Graphics Card • CPU or multiple CPU’s are not the only processors found on a personal computer • The graphics card has a graphics processing unit (GPU). • The GPU is specifically designed to render 3D models onto a 2D display • Designed for floating point computation with a highly parallel architecture.
CUDA • Engineers have begun to exploit the highly parallel architecture of the GPU for general applications. • Graphics companies encourage general purpose computing on the GPU (GPGPU). • Nvidia has developed CUDA (Compute Unified Device Architecture). • Based on the C language programmers can easily shift to developing on the GPU
What Have We Been Doing? • Learning about CUDA • NVIDIA CUDA guides • Lecture slides from University of Illinois, Urbana-Champaign • Papers from various academic groups • University of Illinois, Urbana-Champaign • Tokyo Institute of Technology • University of California at Berkeley • Learning to write parallel programs in CS475 using MPI & OpenMP • Writing simple programs using CUDA and observing performance • Matrix Multiply
Results and Optimizations • Results • Achieved 131 Gigaflops/sec on a GTX280 with N = 1024. GTX 280 peak is 933 Gigaflops/sec. • Optimizations • Tiling the result matrix into smaller sub-matrices and having each thread block compute a sub-matrix will reduce amount of data needed to be loaded by each thread block. • This helps to reduce memory latency.
Significant Lessons Learned and Other Useful Notes • Memory • Must allocate memory on the graphics card from the main program being run on the CPU • Memory for the graphics card is explicitly managed by the programmer • An “extension” to C, not a separate language • Similar to MPI, OpenMP, etc.
Where is our project headed? Increasing problem complexity • Some are no longer “Pleasantly Parallel” • Higher degree of kernel analysis • Moving to more dynamic programs
Additional programs being written for the GPU include: • Scan: Matrix computation where the ith index is the sum of the previous i-1 indices! • Knapsack: profit maximization given a capacity and list of items with their weight & profit • Matrix Multiply for still larger matrices • Triangular Matrix Multiplication
Potential Applications Mandelbrot Set • Pleasantly parallel, familiar • Easily scalable
Potential Applications Ray Tracing • Very computationally intensive • Feasible for non-realtime computations • Very dynamic, due to recursion • High degree of realism
Potential Applications Examples of images generated by Ray Tracing
Potential Applications Hidden Markov Models • Clear parallelism • Wide range of applications
Potential Applications Uses of Hidden Markov Models
To develop a more complex application for the GPU and optimize the performance • To analyze hardware optimizations and evaluate the performance gains • Develop a process for future programmers that will give them the best performance increases with the minimum development effort • Please Note: These goals are tentative and subject to change.
Moore’s Law now being applied to processors per core instead of transistors per processor. • Multi-core machines offer the next generation of performance enhancements… but they are already here! • GPUs provide massively parallel architectures that programmers can take advantage of to see phenomenal performance gains.
Learning to use the CUDA library and some of the nuances. Have gotten good performance on Matrix-Multiply attempts. Also completing CUDA versions of Scan and Knapsack problems. Move on to a more complex application. Researching hardware optimizations that can further enhance performance on GPUs. Develop a combined approach for future applications programmers to follow.
$50 spent for a graphics card that is CUDA compatible. • We’d like to thank Prof. Dan Connors for the use of his machines with Nvidia GTX280 graphics cards. • This provided us free access to a consistent build for all of us to run our code and sample code on. • We don’t project any major costs next semester, except perhaps for some materials for our E-Days presentation.