1 / 21

创新计算机体系结构设计的 FMM 算法分析

创新计算机体系结构设计的 FMM 算法分析. 吕超 上海交通大学 软件学院 2010. 内容提要. 课题背景 前期工作 N-body 问题简介 FMM 算法分析 针对 FMM 优化的配置策略 结论. 课题背景. 项目来源: 新概念高效能计算机体系结构及系统研究开发 国家 863 计划重点项目( 2009AA012201 ) 上海市科委重大科技攻关项目( 08dz501600 ) 课题内容:新型体系结构设计的应用分析及前端设计 前期应用分析 体系结构的前端设计 编译器 / 软件平台设计 应用优化

yuma
Télécharger la présentation

创新计算机体系结构设计的 FMM 算法分析

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 创新计算机体系结构设计的FMM算法分析 • 吕超 • 上海交通大学 • 软件学院 • 2010

  2. 内容提要 • 课题背景 • 前期工作 • N-body 问题简介 • FMM 算法分析 • 针对 FMM 优化的配置策略 • 结论

  3. 课题背景 • 项目来源:新概念高效能计算机体系结构及系统研究开发 • 国家863 计划重点项目(2009AA012201) • 上海市科委重大科技攻关项目(08dz501600) • 课题内容:新型体系结构设计的应用分析及前端设计 • 前期应用分析 • 体系结构的前端设计 • 编译器/软件平台设计 • 应用优化 • 主要目标:设计针对高性能计算的可重构专用处理机体系结构

  4. 前期工作 • 高性能计算应用分析 • CT 和 MRI 的图像重建 • 基于SURF算法的图像局部特征提取与匹配 • 应用的模拟与优化 • 基于多核 CPU 并行的 SURF 算法优化与分析 • 基于 GPGPU 的 SURF 算法实现(CUDA-SURF) • 基于 CPU 和 GPU 异构平台的 SURF 算法优化

  5. N-body 问题简介 • 引入目的 • 作为体系结构的典型应用加以分析 • 给出针对应用优化的体系结构设计策略 • N-body 问题 • 又称多体问题,是天体物理学、流体力学以及分子动力学的基本问题之一 • 用来模拟一个系统中相互作用的粒子的运动规律 • 高性能计算的典型应用 • 数学意义:一组已知初始值的常微分方程

  6. N-body 问题简介(续) • 常见算法 • PP(Particle to Particle)算法 • 应用公式直接计算 • 时间复杂度 O(N2) • PM(Particle Mesh Method)算法 • 利用粒子网格,将多个点的作用看作整体(计算网格的势能) • 时间复杂度 O(NlogN) • TM(Tree Method)算法 • 应用公式直接计算 • 时间复杂度 O(N2)

  7. Graphics Pipeline / Programmable Hardware / Unified Shading Model / NVIDIA GeForce 8800 GTX Overview of GPU Architecture

  8. Graphics Pipeline • The Vertex/Geometry Stage • transforms each vertex from object space into screen space • assembles the vertices into triangles • traditionally performs lighting calculations on each vertex. • The Rasterization Stage • determines the screen positions covered by each triangle • interpolates per-vertex parameters across the triangle. • The Fragment/Pixel Stage • computes the color for each fragment • The Composition/Display Stage • assembles fragments into an image of pixels,

  9. Programmable Hardware • In Programmable Graphics Pipeline • User-defined vertex program • User-defined fragment program • Limitations • Simple, incomplete instruction sets. • Fragment program data types are mostly fixed-point. • Limited number of instructions and a small number of registers. • Limited number of inputs and outputs • No conditional branching

  10. Unified Shader Model • Unified Shader Model must • Have at least 65 k static instructions and unlimited dynamic instructions • Support both 32-bit integers and 32-bit floating-point numbers • Allow an arbitrary number of both direct and indirect reads from global memory (texture) • Support dynamic flow control in the form of loops and branches • Current GPUs support the unified Shader Model 4.0 on both vertex and fragment shaders

  11. Unified Shading Architecture NVIDIA GeForce 8800 GTX Architecture • Green grid – Streaming Multiprocessor • Grid of purple board – Thread Processor • 16 streaming processors of 8 thread processors each.

  12. Unified Shading Architecture (Con.) NVIDIA GeForce 8800 GTX – Thread Processor • One thread processor contains a pair of streaming multiprocessors • One streaming multiprocessor contains shared instruction and data caches, control logic, a 16 KB shared memory, eight stream processors, and two special function units.

  13. GPU Programming Model / GPU Programming Flow Control / GPGPU Techniques / GPGPU Applications How To Program GPGPU

  14. GPU Programming Model • GPU programming model contains • graphics API terminology • stream programming model • A typical GPGPU program using fragment processor is structured as • Segment the general-purpose program into independent parallel sections (kernels) • Specify the range of computation / the size of the output stream to invoke a kernel • Use rasterizer to generate a fragment for every pixel location in the quad • Each of the generated fragments is then processed by the active kernel fragment program • The output of the fragment program is a value (or vector of values) per fragment

  15. GPU Programming Flow Control • Three basic implementations of data-parallel branching • Predication • Both sides of branch are evaluated • Multiple Instruction Multiple Data (MIMD) branching • Different processors flow different paths • Single Instruction Multiple Data (SIMD) branching • If identical for all pixels in the group, only the taken side of the branch must be evaluated. • if one or more of the processors evaluates the branch condition differently, then both sides must be evaluated and the results predicated. Better to Move Branching Up The Pipeline

  16. GPGPU Techniques • Stream Operations: • Map and Reduce – Straightforward [BFH∗04b] BUCK I., FOLEY T., HORN D., SUGERMAN J., FATAHALIAN K., HOUSTON M., HANRAHAN P.: Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23, 3 (Aug. 2004), 777–786. • Scatter and Gather – Avoid Scatter [Buc05b] BUCK I.: Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 32, pp. 509–519. • Scan – All-prefix-sums operation[HS86] [Ble90][Hor05][HSC*05][SLO06, GGK06] • Filtering – Using a combination of scan and search, O(log(n)) archived. [Hor05] HORN D.: Stream reduction operations for GPGPU applications. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 36, pp. 573–589. • Sort – Based on sorting networks, such as parallel bitonic merge sort [BP04,CND03,GZ06,KSW04,KW05a,PDC∗03, Pur04] • Search – Binary search / Nearest neighbor search [Hor05, PDC∗03, Pur04] / [Ben75, FS05, PDC∗03, Pur04]

  17. GPGPU Techniques (Con.) • Data Structures • Iteration • Dense structures supported straightforward • Sparse arrays • adaptive arrays, • and grid-of-list structures require more complex iteration constructs [BFGS03, KW03, LKHW04]. • Generalized Arrays via Address Translation • Address translator converts between 1D array and 2D texture [LKO05,PBMH02] • Optimization techniques for pre-computing these address translation operations before the fragment processor [BFGS03, CHL04, KW03,LKHW04] • Differential Equations, Linear Algebra, Data Queries …

  18. GPGPU Applications • Physically Based Simulation • Signal and Image Processing • Computer Vision • Image Processing • Signal Processing • Tone Mapping • Audio • Image / Video Processing • Global Illumination • Ray tracing, photon mapping, radiosity, subsurface scattering … • Geometric Computing • Databases and Data Mining

  19. Conclusion • Highly parallel nature • But currently only data-parallel for general purpose computation • Many applications can be mapped on GPU • But no double-precision, scatter and efficient branching supported • Program with graphics API • But hard to understand and use • What we are looking forward to • More programmable and flexible hardware needed • High-level programming model needed

  20. Any question?

  21. The End Thank you

More Related