1 / 31

Shader Performance Analysis on a Modern GPU Architecture

Shader Performance Analysis on a Modern GPU Architecture. Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer Architecture UPC. Roger Espasa Intel DEG Barcelona. Introduction. Shaders in GPUs evolving towards general programming

sybil
Télécharger la présentation

Shader Performance Analysis on a Modern GPU Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona

  2. Introduction • Shaders in GPUs evolving towards general programming • Branches, generic loads, scatter • New types of shaders: geometry in DX10 • Current specialized shaders • Area hungry • Unbalancing leads to inefficiencies • This paper: unify all shaders • ~8% higher performance with less area & resources

  3. Outline • Attila – our GPU architecture • Attila-Classic: Non-unified shaders • Attila-Unified: Unified Shaders • Simulation Framework • Results

  4. Outline • Attila – our GPU architecture • Attila-Classic: Non-unified shaders • Attila-Unified: Unified Shaders • Simulation Framework • Results

  5. ATTILA • Our implementation of current GPUs • Inspired in both NVIDIA and ATI • Not exact to either pipeline • Lack of detailed micro architecture information • Educated guessing on our side • Implemented Features • 2D Homogeneous Recursive Rasterization • Tiled Rasterization • Hierarchical Z • Texture compression • Anisotropic filtering • Depth compression, fast z/stencil and color clear

  6. Outline • Attila – our GPU architecture • Attila-Classic: Non-unified shaders • Attila-Unified: Unified Shaders • Simulation Framework • Results

  7. Attila Classic Vertex Fetch Vertex Shader Vertex Shader Vertex Shader Vertex Shader Primitive Assembly Clipping Specialized Shaders Triangle Setup Rasterization HierarchicalZ Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP ROP ROP ROP Memory Controller Memory Controller Memory Controller Memory Controller

  8. Specialized Shader Issues • Unbalancing • In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) • In vertex shading limited scenarios up to 70% of the processing power remains idle. • Dedicated Area • 4 unused vertex shaders have the same processing power than one 1 fragment shader • 4 vertex shaders require 66% the area of a fragment shader • Different Designs • Increases the complexity of the micro architecture • Increases development and verification time

  9. Outline • Attila – our GPU architecture • Attila-Classic: Non-unified shaders • Attila-Unified: Unified Shaders • Simulation Framework • Results

  10. Attila Unified Shader Vertex Fetch Shader Scheduler Distributor Primitive Assembly Clipping Shader Triangle Setup Rasterization Shader HierarchicalZ Unified Shader Pool ROP ROP ROP ROP Memory Controller Memory Controller Memory Controller Memory Controller

  11. Unified Shader Architecture • Benefits • Unified programming model • DX10/SM4 and OpenGL/GLSlang are already pushing for it • The same features for all the program targets • Texturing, branching, outputs • Not just vertex and fragment programs • DX10 => geometry shader • General Purpose GPU or Stream Processor • Workload balance • Shading resources allocated as required at any point of the rendering

  12. Unified Shader Architecture • Costs • Scheduler • Select which kind of workload must be processed next • Partly implemented with multithreading in the fragment shader to hide texture access latency • Larger instruction memory and constant bank • Rerouting required • All the paths cross the shader pool

  13. Outline • Attila – our GPU architecture • Attila-Classic: Non-unified shaders • Attila-Unified: Unified Shaders • Simulation Framework • Results

  14. ATTILA Framework • OpenGL Interceptor tool • OpenGL library for Attila GPU • Driver for our Attila GPU • Attila GPU simulator • Signal Visualizer Tool

  15. Collect Verify Simulate Analyze OpenGL Application GLInterceptor Trace GLPlayer Statistics Vendor OpenGL Driver Vendor OpenGL Driver ATTILA OpenGL Driver Signal Traffic ATI R520/NVidia G70 ATI R520/NVidia G70 ATTILA Simulator Framebuffer Framebuffer Framebuffer Signal Visualizer CHECK! CHECK!

  16. Collect Verify Simulate Analyze OpenGL Application • GLInterceptor • Capture a trace of OpenGL API alls from a real game GLInterceptor Trace GLPlayer Statistics Vendor OpenGL Driver Vendor OpenGL Driver ATTILA OpenGL Driver Signal Traffic ATI R520/NVidia G70 ATI R520/NVidia G70 ATTILA Simulator Framebuffer Framebuffer Framebuffer Signal Visualizer CHECK! CHECK!

  17. Collect Verify Simulate Analyze OpenGL Application GLInterceptor • GLPlayer • Reproduce the captured trace Trace GLPlayer Statistics Vendor OpenGL Driver Vendor OpenGL Driver ATTILA OpenGL Driver Signal Traffic ATI R520/NVidia G70 ATI R520/NVidia G70 ATTILA Simulator Framebuffer Framebuffer Framebuffer Signal Visualizer CHECK! CHECK!

  18. Collect Verify Simulate Analyze • OpenGL Library • - Transforms Fixed Function into Shader code • - 200 API Calls supported • - ARB Vertex and Fragment extensions • - Alpha and Fog emulated via Shader code • Driver • - Low level access • - Attila memory management OpenGL Application GLInterceptor Trace GLPlayer Statistics Vendor OpenGL Driver Vendor OpenGL Driver ATTILA OpenGL Driver Signal Traffic ATI R520/NVidia G70 ATI R520/NVidia G70 ATTILA Simulator Framebuffer Framebuffer Framebuffer Signal Visualizer CHECK! CHECK!

  19. Collect Verify Simulate Analyze • ATTILA Simulator • - Detailed cycle-by-cycle simulation of all pipeline stages • - 20 boxes, modeling a 100-deep pipeline • - Execute@Execute: functionality embedded at each pipeline stage OpenGL Application GLInterceptor Trace GLPlayer Statistics Vendor OpenGL Driver Vendor OpenGL Driver ATTILA OpenGL Driver Signal Traffic ATI R520/NVidia G70 ATI R520/NVidia G70 ATTILA Simulator Framebuffer Framebuffer Framebuffer Signal Visualizer CHECK! CHECK!

  20. Find the differences  Attila NVIDIA GeForce FX 5900XT

  21. Outline • Attila – our GPU architecture • Attila-Classic: Non-unified shaders • Attila-Unified: Unified Shaders • Simulation Framework • Results

  22. Benchmark • Unreal Tournament 2004 • Fixed function OpenGL API • Vertex and fragments shaders generated by our library • 1024x768 resolution • 8x Anisotropic Filtering • 160 of 450 frames simulated • 40 frames ~ 1 day simulation • On a Xeon P4 @ 2.0Ghz

  23. Baseline Configuration • Four Vertex Shaders (only for Attila- Classic) • Fragment and Unified shader configuration: • 32 threads • 4 fragments/vertices per thread • 16 128-bit FP registers available for temporal storage per thread • n SIMD ALUs • 1 scalar ALU (optional) • 1 Texture Unit per Shader Unit • 16 KB texture cache • Single cycle bilinear and two cycle trilinear • AF up to 16x • Geometry and Rasterization pipelines limited to 1 vertex and 1 triangle per cycle • Two ROPs: 8 z and 8 color values written per cycle • Four 64-bit DDR buses: peak bandwidth 64 bytes/cycle

  24. “Classic” Performance 7% ~40% 8sh ~45% 6sh 4sh ~75% 8% 2sh • 8% improvement for 2-way • Near linear improvement for 4 shaders • Sublinear improvement for 6 and 8 shaders • Limited by memory bandwidth and latency

  25. Frame 330 – Detailed Zoom Vertex shading limited Vertex shader and fragment shader workload for 4 vertex shader units and 2 fragment shader units

  26. Unified Shader Performance 8sh 6sh 4sh 2sh • Unified improvement ranges from 1% (2 shaders) to 8% (eight 1-way shaders) • Fragment shading limited • Vertex fetch limited • Geometry pipeline limited

  27. Area Estimation 160 – 120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)

  28. Shader Scaling vs Transistors 8sh 6sh 4sh 2sh • Linear for 4 shader units, sublinear for more than 4 shader units • Up to 30% more efficient per area for the unified architecture (two 1-way shaders)

  29. Conclusion • Attila Unified architecture has better performance than Attila Classic with less hardware • Up to 8% better performance • 8% to 25% less area required • 10% to 30% better performance per area • Up to 8% better performance for 2-way shader units • 160% better performance from 2 to 8 fragment or unified shader units • Memory bandwidth limited beyond 4 shaders

  30. Questions

  31. Performance of Attila Unified vs Classic Attila

More Related