html5-img
1 / 44

Product Availability Update

Product Availability Update. Processamento Paralelo em GPU’s na Arquitetura Fermi Arnaldo Tavares Tesla Sales Manager for Latin America. Quadro or Tesla?. TESLA TM. QUADRO TM. GPU Computing. CPU + GPU Co-Processing. 448 cores. 4 cores. CPU 48 GigaFlops (DP). GPU

hugh
Télécharger la présentation

Product Availability Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Product Availability Update ProcessamentoParaleloem GPU’s naArquitetura Fermi Arnaldo Tavares Tesla Sales Manager for Latin America

  2. Quadro or Tesla? TESLATM QUADROTM

  3. GPU Computing CPU + GPU Co-Processing 448 cores 4 cores CPU 48 GigaFlops (DP) GPU 515 GigaFlops (DP) (Average efficiency in Linpack: 50%)

  4. 50x – 150x 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech Matlab Computing AccelerEyes Astrophysics RIKEN 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland

  5. Increasing Number of Professional CUDA Apps Available Now Future • CUDA C/C++ • PGI • Accelerators • Platform LSF • Cluster Mgr • TauCUDA • Perf Tools • Parallel Nsight • Vis Studio IDE • TotalView • Debugger • MATLAB • PGI CUDA • x86 Tools • PGI CUDA • Fortran • CAPS HMPP • Bright Cluster • Manager • Allinea DDTDebugger • ParaTools • VampirTrace • AccelerEyes • Jacket MATLAB • Wolfram Mathematica • NVIDIA NPP • Perf Primitives • EMPhotonics • CULAPACK • CUDA FFT • CUDA BLAS • Thrust C++ • Template Lib • MAGMA (LAPACK) • NVIDIA • Video Libraries • RNG & SPARSE CUDA Libraries Libraries • Headwave Suite • OpenGeoSolutionsOpenSEIS • GeoStar Seismic Suite • Acceleware • RTM Solver • StoneRidge • RTM • Paradigm • RTM • Panorama Tech Oil & Gas • ffA SVI Pro • VSG • Open Inventor • Seismic City • RTM • Tsunami • RTM • Paradigm • SKUA • AMBER • NAMD • HOOMD • TeraChem • BigDFT • ABINT • Acellera • ACEMD • DL-POLY Bio-Chemistry • GROMACS • LAMMPS • VMD • GAMESS • CP2K • OpenEye ROCS • PIPER • Docking • MUMmerGPU • CUDA-BLASTP • CUDA-MEME Bio-Informatics • HEX Protein • Docking • CUDA-EC • CUDA SW++ • SmithWaterm • GPU-HMMR CAE • ACUSIM • AcuSolve 1.8 • Autodesk • Moldflow • Prometch • Particleworks • Remcom • XFdtd 7.0 • LSTC • LS-DYNA 971 • FluiDyna • OpenFOAM • ANSYS • Mechanical • Metacomp • CFD++ • MSC.Software • Marc 2010.2 • Announced • Available

  6. Increasing Number of Professional CUDA Apps Available Now Future • Adobe Premier Pro CS5 • ARRI • Various Apps • GenArts • Sapphire • TDVision • TDVCodec • Black Magic • Da Vinci • The Foundry • Kronos Video • MainConcept • CUDA Encoder • Fraunhofer • JPEG2000 • Cinnafilm • Pixel Strings • Assimilate • SCRATCH • Elemental • Video • Bunkspeed • Shot (iray) • Refractive SW • Octane • Random Control Arion • ILM • Plume • Autodesk • 3ds Max • Cebas • finalRender • Works Zebra • Zeany Rendering • mental images • iray (OEM) • NVIDIA OptiX (SDK) • Caustic Graphics • Weta Digital • PantaRay • Lightworks • Artisan • Chaos Group • V-Ray GPU • NAG • RNG • Numerix Risk • SciComp • SciFinance • RMS Risk • Mgt Solutions Finance • Murex • MACS • Aquimin • AlphaVision • Hanweck • Options Analy • Agilent • EMPro 2010 • CST Microwave • Agilent ADS • SPICE • Acceleware • FDTD Solver • Rocketick • VeritlogSim EDA • Synopsys • TCAD • SPEAG • SEMCAD X • GaudaOPC • Acceleware • EM Solution • Siemens 4D Ultrasound • Digisens Medical • Schrodinger • Core Hopping • Useful Progress Med • MVTec Machine Vis Other • MotionDSP • Ikena Video • Manifold • GIS • Dalsa Machine Vision • Digital Anarchy Photo • Announced • Available

  7. 3 of Top5 Supercomputers

  8. 3 of Top5 Supercomputers

  9. What if Every Supercomputer Had Fermi? Linpack Teraflops 450 GPUs 110 TeraFlops $2.2 M Top 50 225 GPUs 55 TeraFlops $1.1 M Top 100 150 GPUs 37 TeraFlops $740K Top 150 Top 500 Supercomputers (Nov 2009)

  10. Hybrid ExaScale Trajectory * This is a projection based on Moore’s law and does not represent a committed roadmap

  11. Tesla Roadmap

  12. The March of the GPUs NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU

  13. Project Denver

  14. Expected Tesla Roadmap with Project Denver

  15. Workstation / Data Center Solutions 2 Tesla M2050/70 GPUs Integrated CPU-GPU Server 2x Tesla M2050/70 GPUs in 1U OEM CPU Server + Tesla S2050/70 4 Tesla GPUs in 2U Workstations Up to 4x Tesla C2050/70 GPUs

  16. Tesla C-Series Workstation GPUs

  17. How is the GPU Used? • Basic Component: “Stream Multiprocessor” (SM) • SIMD: “Single Instruction Multiple Data” • Same Instruction for all cores, but can operate over different data • “SIMD at SM, MIMD at GPU chip” Source: Presentation from Felipe A. Cruz, Nagasaki University

  18. The Use of GPU’s and Bottleneck Analysis Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

  19. The Fermi Architecture • 3 billion transistors • 16 x Streaming Multiprocessors (SM’s) • 6 x 64-bit Memory Partitions = 384-bit Memory Interface • Host Interface: connects the GPU to the CPU via PCI-Express • GigaThread global scheduler: distribute thread blocks to SM thread schedulers

  20. SM Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • 32 CUDA cores per SM (512 total) • 16 x Load/Store Units = source and destin. address calculated for 16 threads per clock • 4 x Special Function Units (sin, cosine, sq. root, etc.) • 64 KB of RAM for shared memory and L1 cache (configurable) • Dual Warp Scheduler Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache

  21. Dual Warp Scheduler • 1 Warp = 32 parallel threads • 2 Warps issued and executed concurrently • Each Warp goes to 16 CUDA Cores • Most instructions can be dual issued (exception: Double Precision instructions) • Dual-Issue Model allows near peak hardware performance

  22. CUDA Core Architecture Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File • New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs • Newly designed integer ALU optimized for 64-bit and extended precision operations • Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core CUDA Core Core Core Core Core Dispatch Port Operand Collector Core Core Core Core Core Core Core Core FP Unit INT Unit Core Core Core Core Load/Store Units x 16 Result Queue Special Func Units x 4 Interconnect Network 64K ConfigurableCache/Shared Mem Uniform Cache

  23. Fused Multiply-Add Instruction (FMA)

  24. GigaThreadTM Hardware Thread Scheduler (HTS) • Hierarchically manages thousands of simultaneously active threads • 10x faster application context switching (each program receives a time slice of processing resources) • Concurrent kernel execution HTS

  25. GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Kernel 1 Kernel 1 Kernel 2 Ker4 Kernel 2 Kernel 3 Kernel 2 nel Kernel 2 Kernel 5 Time Kernel 3 Kernel 4 Kernel 5 Serial Kernel Execution Parallel Kernel Execution

  26. GigaThread Streaming Data Transfer Engine • Dual DMA engines • Simultaneous CPUGPU and GPUCPU data transfer • Fully overlapped with CPU and GPU processing time • Activity Snapshot: SDT Kernel 0 CPU SDT0 GPU SDT1 Kernel 1 CPU SDT0 GPU SDT1 Kernel 2 CPU SDT0 GPU SDT1 Kernel 3 CPU SDT0 GPU SDT1

  27. Cached Memory Hierarchy • First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory • Shared/L1 Cache per SM (64KB) • Improves bandwidth and reduces latency • Unified L2 Cache (768 KB) • Fast, coherent data sharing across all cores in the GPU • Global Memory (up to 6GB)

  28. CUDA: Compute Unified Device Architecture • NVIDIA’s Parallel Computing Architecture • Software Development Platform aimed to the GPU Architecture

  29. Thread Hierarchy • Kernels (simple C program) are executed by thread • Threads are grouped into Blocks • Threads in a Block can synchronize execution • Blocks are grouped in a Grid • Blocks are independent (must be able to be executed at any order Source: Presentation from Felipe A. Cruz, Nagasaki University

  30. Memory and Hardware Hierarchy • Threads access Registers • CUDA Cores execute Threads • Threads within a Block can share data/results via Shared Memory • Streaming Multiprocessors (SM’s) execute Blocks • Grids use Global Memory for result sharing (after kernel-wide global synchronization) • GPU executes Grids Source: Presentation from Felipe A. Cruz, Nagasaki University

  31. Full View of the Hierarchy Model

  32. Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (0, 0) Thread (0, 2) Thread (1, 2) Thread (1, 1) Thread (1, 0) Thread (2, 2) Thread (2, 1) Thread (2, 0) Thread (3, 1) Thread (3, 2) Thread (3, 0) Thread (4, 2) Thread (4, 0) Thread (4, 1) IDs and Dimensions Threads • 3D IDs, unique within a block Blocks • 2D IDs, unique within a grid Dimensions set at launch time • Can be unique for each grid Built-in variables • threadIdx, blockIdx • blockDim, gridDim

  33. Compiling C for CUDA Applications void serial_function(… ) { ... } void other_function(int ... ) { ... } void saxpy_serial(float ... ) { for(int i = 0; i<n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..); ... } • C CUDA • Key Kernels • Rest of C • Application NVCC (Open64) • CPU Compiler Modify into Parallel CUDA code • CUDA object • files • CPU object • files Linker • CPU-GPU • Executable

  34. C for CUDA : C with a few keywords Standard C Code Parallel C Code

  35. Software Programming Source: Presentation from Andreas Klöckner, NYU

  36. Software Programming Source: Presentation from Andreas Klöckner, NYU

  37. Software Programming Source: Presentation from Andreas Klöckner, NYU

  38. Software Programming Source: Presentation from Andreas Klöckner, NYU

  39. Software Programming Source: Presentation from Andreas Klöckner, NYU

  40. Software Programming Source: Presentation from Andreas Klöckner, NYU

  41. Software Programming Source: Presentation from Andreas Klöckner, NYU

  42. Software Programming Source: Presentation from Andreas Klöckner, NYU

  43. CUDA C/C++ Leadership CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 • C++ inheritance • Fermi arch support • Tools updates • Driver / RT interop • C Compiler • C Extensions • Single Precision • BLAS • FFT • SDK • 40 examples • Win XP 64 • Atomics support • Multi-GPU • support cuda-gdb HW Debugger • Double Precision • Compiler • Optimizations • Vista 32/64 • Mac OSX • 3D Textures • HW Interpolation • DP FFT • 16-32 Conversion • intrinsics • Performance • enhancements

  44. Why should I choose Tesla over consumer cards?

More Related