1 / 44

Sridhar Rajagopal

Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation. Sridhar Rajagopal. Digital Signal Processors (DSPs). Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications

gavan
Télécharger la présentation

Sridhar Rajagopal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Parallel Digital Signal Processors:Algorithm mapping, Architecture scaling,and Workload adaptation Sridhar Rajagopal

  2. Digital Signal Processors (DSPs) Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications A 5 billion $ (and growing) market today

  3. We always want something faster! New high performance applications drive need for faster DSPs • Physical-layer signal processing in high speed wireless communications to support multimedia • Application-layer signal processing for video and imaging

  4. Example : wireless systems 32-user system 2G 3G 4G Data rates Algorithms Estimation Detection Decoding Theoretical min ALUs @ 1 GHz 16 Kbps /user Single-user Correlator Matched filter Viterbi > 2 128 Kbps/user Multi-user Max. likelihood Interference cancellation Viterbi > 20 1 Mbps/user MIMO Chip equalizer Matched filter LDPC > 200 Time 1996 2003 ?

  5. Data-Parallel DSPs: state-of-the-art Internal memory Cluster of ALUs + + + + + + + + … + + + + * * * * * * * * * * * * Clusters of ALUs provide billions of computations per second Exploit data parallelism in signal processing applications Imagine stream processor – Stanford (1998 - 2004)

  6. Proposal:Research questions for DP-DSPs • Will DP-DSPs work well for wireless systems? • How do I design DP-DSPs to meet real-time at lowest power? • Can I improve power efficiency further by adapting DSPs to the application?

  7. Contributions: Algorithm mapping • Efficient mapping of (wireless) algorithms • parallelization, structure, memory access patterns • tradeoffs between ALU utilization, inter-cluster communication, memory stalls, packing • A reduced inter-cluster network proposed • exploits inter-cluster communication patterns • allows greater scalability of the architecture by reducing wires

  8. Contributions: Architecture scaling • Design methodology and tool to explore architectures for low power • Provides candidate architectures for low power • Provides insights into ALU utilization and performance • Compile-time exploration is orders-of-magnitude faster than run-time exploration

  9. Contributions: Workload adaptation • Adapt the number of clusters and ALUs to changes in workload during run-time • Multiplexer network designed • adapts clusters to DP at run-time • turns off unused clusters using power gating • Significant power savings at run-time (up to 60%)

  10. Thesis contributions + + + + + + + + + * * * * * * * * * Data-Parallel DSPs Algorithm mapping: Design of algorithms for efficient mapping and performance Architecture scaling: Having designed the algorithms, find a low power processor Workload adaptation: Having designed the processor, improve power at run-time

  11. Outline • DP-DSPs : Parallelism and architecture • Power-aware design exploration • Power-aware resource utilization at run-time • Conclusions

  12. Parallelism levels in DP-DSPs Instruction Level Parallelism (ILP) - DSP Subword Parallelism (SubP) - DSP Data Parallelism (DP) – vector processor Not independent • DP can decrease by increasing ILP and SubP – loop unrolling

  13. Code snippet for ILP, SubP, DP int i,a[N],b[N],sum[N]; short int c[N],d[N],diff[N]; for (i = 0; i< 64; ++i) { sum[i] = a[i] + b[i]; diff[i] = c[i] - d[i]; } DP ILP SubP

  14. Data-Parallel DSPs Internal memory + + + + + + + + … ILP SubP + + + + * * * * micro controller * * * * * * * * DP • ILP, SubP within cluster, DP across clusters • Communication within clusters using inter-cluster comm. network • Microcontroller issues same instruction to all clusters

  15. ILP is resource-bound Inter-cluster communication Adders Multipliers Time Schedule for matrix-matrix multiplication as ALUs increase • ILP dependent on resources such as ALUs, read/write ports, inter-cluster communication, registers • Any one resource bottleneck can affect ILP

  16. Signal processing algorithms have DP in plenty Observations: • More DP available after exploiting ILP and SubP to the point of diminishing returns • Used to set number of clusters • As clusters are added and exploit this ‘extra’ DP, ILP and SubP are not affected significantly This ‘extra’ DP is defined as Cluster DP (CDP)

  17. Observing CDP in Viterbi decoding 1000 K = 9 K = 7 DSP K = 5 100 Frequency needed to attain real-time (in MHz) 10 Max CDP 1 1 10 100 Number of clusters

  18. Designing low power DP-DSPs + ‘10’ ‘10’ + + + + + + + + + + ‘1’ ‘10’ ‘a’ ‘a’ + + ‘a’ + + + + ‘10’ * + + ‘10’ + + + + + + + + + * * * * * * ‘1’ * ‘10’ * ‘m’ * ‘m’ ‘m’ * * * * * * * * * * * * * * * * * * ‘100’ clusters ‘1’ cluster ‘c’ clusters 10 MHz 100 GHz ‘f’ MHz Find the right (a,m,c,f) to minimize power a – #adders/cluster, m – #multipliers/cluster, c – #clusters

  19. Detailed simulation using the Imagine processor simulator • Cycle accurate, parameterized simulator • Insights into operations every cycle • High-level C++-based programming • GUI interface shows dependencies and schedule • Power and VLSI scaling model available • Open source allows modifications in architecture, tools

  20. Need for design exploration tool • Random choice may be way off • 100x power variation possible • Exhaustive simulation not possible • large parameter space (hours for each simulation) • DSP compilers need hand optimizations for performance • evolving algorithms -- architecture exploration needed

  21. Design exploration framework + + + + + + + + + * * * * * * * * * Design Base Data-Parallel phase DSP Explore (a,m,c,f) Design combination that workload minimizes power (worst-case) Hardware implementation Dynamic adaptation Utilization Application to turn down (a,m,c,f) phase workload to save power

  22. DSPs are compute-bound with predictable performance Microcontroller stalls t stall Exposed memory stalls Hidden memorystalls Total execution time (cycles) t Computations compute

  23. Minimization for power C(a,m,c) – capacitance from simulator model f(a,m,c) – real-time clock frequency – obtained by running application on (a,m,c) architecture

  24. Sensitivity to technology and modeling • Sensitivity to technology ‘p’ • Sensitivity to adder-multiplier power ratio ‘’ • 0.01    0.1 for 32-bit adders and 32x32 multipliers • Sensitivity to memory stalls ‘’ • difficult to predict at compile time (5-20 %) • assume q = 25% of execution time as worst case • fstall= q* (1-) * fmin 0    1

  25. Design exploration: big picture • (a,m,c) = (, , ) • Find (a,m,c) where ILP, SubP, DP are fully exploited • Find c that minimizes P for (max(a), max(m)) • Find (a,m) that minimizes P using c • Explore sensitivity to , , p

  26. Running algorithms at (amax,mmax, cCDP)

  27. Real-time frequency with clusters for (a,m) = (5,3) 4 10 b = 0 b = 0.5 b = 1 Frequency (MHz) 3 10 538 MHz 541 MHz 2 10 0 1 2 3 10 10 10 10 Clusters

  28. Choosing clusters c = 64, 541 MHz 0 10 -1 10 Normalized Power -2 10 2 µ Power f 2.5 µ Power f 3 µ Power f -3 10 0 1 2 3 10 10 10 10 Clusters

  29. ALU utilization (+,*) c = 64,  = 0.01,  = 1, p = 3

  30. Choosing ALUs (a,m) for c = 64

  31. Insights from analysis • Sensitivity importance: p, ,  • Design gives candidates for low power solutions Design I : (a,m,c): (, , )  (5,3,512)  (5,3,64)  (2,1,64) Design II : (a,m,c): (, , )  (5,3,512)  (5,3,64)  (3,1,64) • Power minimization related to ALU efficiency • same as maximizing a scaled version of ALU utilization

  32. Advantages of design exploration tool • Simulator (S) • cycle-accurate (execution time at run-time) • explore 100 machine configurations in 100 hours (conservative) • modification of parameters and code for different runs • Tool (T) • cycle-approximate (execution time at compile time) • explore millions of configurations in 100 hours • automated process all the way • generate plots for defense the day before • Rapid evaluation of candidate algorithms for future systems

  33. Verification of design tool 1000 Stalls 800 Computations 600 (Execution time) Real-time clock frequency (MHz) 400 200 T S T S T S Design I Design II Human T- Tool S - Simulator Human (3,3,32) @ 1.2V, 0.13 , 1 GHz = 18.2 W Exploration tool choice : (2,1,64) at 887 MHz Estimated base power @ 1.2V, 0.13  = 13.2 W

  34. Cluster utilization 100 80 60 Cluster utilization (%) 40 32 clusters 20 64 clusters 20 40 60 Cluster index number • 64 cluster inefficient in terms of cluster utilization (54% for 33:64) • But, still lower power than 32 clusters due to the difference in f • can see difference reduces as p  2

  35. Improving power efficiency • Clusters significant source of power consumption (50-75%) • When CDP < c, unutilized clusters waste power • Dynamically turn off clusters using power gating to improve power efficiency

  36. Data access difficult after adaptation 4  2 clusters + + + + + + + + + + + + * * * * * * * * * * * * Clusters off – then how to get data from other banks? 4  2 clusters • Data not in the correct memory banks • Overhead in bringing data : external memory, inter-cluster network

  37. Multiplexer network design No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off Turned off using power gating to eliminate static and dynamic power dissipation Multiplexer network adapts clusters to DP

  38. Run-time variations in workload 100 K = 9 K = 7 K = 5 80 60 Cluster utilization (%) 40 20 20 40 60 Cluster index number

  39. Benefits of multiplexer network Power efficiency at design time: Human choice : (3,3,32) Base power @ 1.2V, 0.13 , 1 GHz = 18.2 W Exploration tool choice : (2,1,64) Base power @ 1.2V, 0.13 , 887 MHz = 13.2 W Power efficiency at run-time: With mux network ( K = 9) = 9.9 W ( K = 7) = 7.4 W (K = 5) = 6.8 W

  40. Design exploration for 2G-3G-4G systems 5 10 4G* (1,1,32) and (2,1,32) 3G 2G 4 10 3 Real-time clock frequency (MHz) 10 2 (2,1,64) and (3,1,64) 10 1 10 1 2 3 10 10 10 Data rates A “power”ful tool for algorithm-architecture exploration

  41. Broader impact • Power-aware design exploration with improved run-time power efficiency • Techniques can be applied to all high performance, power efficient DSP designs • Handsets, cameras, video

  42. Future extensions • Fabrication needed to verify concepts • Higher performance • Multi-threading (ILP, SubP, DP, MT) • Pipelining (ILP, SubP, DP, MT, PP) • LDPC decoding • Sparse matrix requires permutations over large data • Indexed SRF in stream processors [Jayasena, HPCA 2004]

  43. Conclusions • Providing high performance with 100-1000’s of ALUs and providing low power designs • a challenge for DSP designers • Algorithm design for efficient mapping on DP-DSPs • Design exploration tool for low power DP-DSPs • Provides candidate DSPs for low power • Allows algorithm-architecture evaluation for new systems • Power efficiency provided during both design and use of DP-DSPs

  44. Acknowledgements • Dr. Joseph R. Cavallaro, Dr. Scott Rixner • Imagine stream processor group at Stanford • Abhishek, Ujval, Brucek, Dr. Dally • Marjan, Predrag, Alex • 4G MIMO + LDPC • Thesis committee • Nokia, Texas Instruments, TATP, NSF

More Related