1 / 23

Data movement in Dark Silicon Systems

Data movement in Dark Silicon Systems. Pietro Cicotti UCSD/SDSC, Performance Modeling and Characterization lab ( PMaC ) Dagstuhl Seminar, Dark Silicon: from Embedded to HPC Systems, February 1 st 2016. Locality changes with Dark Silicon. Power on phase New resources become available

greener
Télécharger la présentation

Data movement in Dark Silicon Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data movement in Dark Silicon Systems Pietro Cicotti UCSD/SDSC, Performance Modeling and Characterization lab (PMaC) Dagstuhl Seminar, Dark Silicon: from Embedded to HPC Systems, February 1st 2016

  2. Locality changes with Dark Silicon • Power on phase • New resources become available • Memory coming online is empty • Processing resources may need to access or fetch data • Power off phase • Updates must be flushed • On/Off cycles imply data movement and changes in locality

  3. Example Computational sprinting • A number of dark cores are activated and participate to a computation • HPC: threads are spawned (or resumed) and occupy the resources working collaboratively • Is data shared? Read-Only? Streamed or Reused? • Is architecture homogeneous? Do dark cores have same caches? • What determines the duration of the sprint? • Thermal capacity available is independent of computational phases • Locality changes with the resources utilized

  4. Example Invasive Computing • More flexibility than sprinting • Application/Runtime-System is in charge • Request, use, and release resources • Different kinds of resources • Compute, store, communicate • Data moving in and out while invading and retreating • Needs understanding of data access pattern to devise best invasion strategy • Interaction with system required to match goals with resources available

  5. PMaC Work on Modeling Tools and Models • Develop tools and models • Understand application’s behavior • Binary Instrumentation, HW Counters • Understand system’s behavior • HW Counters, Simulation • Models • application+system characterization estimate performance/efficiency • Examples: • Model systems, emerging technology, devise optimizations.

  6. Configurable Memory Hierarchy Exascale report, Peter Kogge, 2009 [4] If the memory hierarchy can be partially powered • Idea: Increase efficiency by customization • Accelerators, reconfigurable hardware, heterogeneous systems, etc. • Focus of this work is on-chip memory • E.g. 20% of CPU power draw • Configure caches • Power on/off cache levels • Resize levels (power on/off banks) • How to select optimal configuration?

  7. Workload Characterization Locality characterization • Analyze address stream • Binary instrumentation • Cache simulations Benchmarks • 37 benchmarks • HPC, DoD, DoE, bio, and data mining

  8. Hardware Configurations L3 size L1 size L2 size 2652 configurations space searched • Dimensions: • Size: 8KB-64MB • 1-3 levels • Ln*2≤Ln+1 • Parameters tuned for each size • Associativity, banking, etc. approximate current processors • Dynamic energy, leakage, latency, estimated with CACTI [5] • DRAM • 40pJ/bit, 400 cycle latency [6] • Reference: 32KB L1, 256KB L2, 2.5MB L3/core • Estimate performance and energy variations using binary instrumentation and simulations of caches

  9. Searching Configurations Optimization: energy • Avg. 5% energy, 0.7x performance • Optimization: energy & performance • Avg. 30% energy, 1.2x performance • + performance • 1x speedup • - performance

  10. Variations on the configurations • Restricted configuration space • Power of 2 sizes, similar to reference, obtained by clustering Assuming leakage reduction • Leakage is significant fraction of power/energy [4,7,8] • No leakage, 50%, 10%, leakage

  11. Automatic Selection Use application characterization to drive selection • Reuse-distance • Architecture independent metric: unique addresses between references [9]

  12. ADAMANT ADAMANT: tools to profile locality of data objects and optimize data movement [2] • Understand impact of new memory technology on workloads • Explore optimizations and configurations • Code refactoring, system configuration • Placement of data objects in memory • Track memory usage • Data placement • Data access patterns

  13. ADAMANT View • ADAMANT Characterization and Object View • Read symbols from application and libraries • Intercept dynamic memory allocation • Fill object database with events from characterization modules • PEBIL • Capture address stream • Simulator generates cache events • Examples • Events used to model performance of different memory configurations

  14. Case Study: NPB-BT • Block Tridiagonal Solver • Stack accounts for large number of references • Hit rates are very high, caches filter most references • fields_ variable has most of memory footprint and references • Hit rates are relatively low, target of almost all DRAM accesses • Example: 2.13x slowdown with fields_ in NVM, 2.02x slowdown with L4

  15. Case Study: Graph500 Representative of a class of data intensive problems • E.g. social networks, bioinformatics • BFS kernel Identify dynamically allocated objects • Even dynamically allocated objects can be named • Programmer assisted step • Match allocation in code with object • stack trace • size

  16. Case Study: Graph500 • Analysis of object view • edgemem: 84% footprint, 88% reads to memory, no writes that reach memory • queues and pred: 99% writes to memory, little footprint • Example: 1.10x slowdown - all NVM, 1.06x queues in DRAM, 1.09x slowdown - L4

  17. Case Study: Velvet • De-novo genomic assembler for short-read sequencing

  18. Case Study: Velvet • Distribution of capacity and references 95% 99%

  19. Case Study: Velvet • Hit rates and load vs. store 96.3% of memory read and write operations

  20. Case Study: Velvet Analysis of object view • Large bins and memory allocation layer obfuscate objects characteristics • Allocation blocks: 95.6% of capacity, 96.3% of memory reads and writes • Example allocation: 2.56x slowdown - all NVM, 2.53x - blocks in NVM, 2.38x - L4

  21. Case Study: Velvet Dividing arrays by usage • Some arrays are in low-reference bins, and the hit rate varies significantly by array • Example allocation: 2.56x slowdown - all NVM, 2.53x - blocks in NVM, 2.38x - L4

  22. Conclusions Data Movement is critical for performance and energy/power efficiency • Investigating and developing tools to characterize and understand data movement • Researching and modeling emerging technologies and future architectures Research relevant and applicable to dark silicon? • Help understanding implications to locality • Model the impact of dark silicon on data movement and applications • Tune configurations and resource usage

  23. Questions References [1] P. Cicotti, L. Carrington, A. Chien, “Toward Application-Specific Memory Reconfiguration for Energy Efficiency”, Int. Workshop on Energy Efficient Supercomputing, 2013 [2] P. Cicotti, L. Carrington, “ADAMANT: tools to capture, analyze, and manage data movement”, submitted to the Int. Conference on Computational Science. [3] A. Suresh, P. Cicotti, L. Carrington, “Evaluation of emerging memory technologies for HPC, data intensive applications”, Int. Conference on Cluster Computing (Cluster) 2014 [4] P. Kogge. at. al. “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems”, CSE Dept. Tech Report TR-2008-13, 2008 [5] S. Thoziyoor, J.H. Ahn, M. monchiero, J.B. Brockman, N. P. Jouppi, “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies”, 35th annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 2008 [6] T. Vogelsang, “Understanding the Energy and Consumption of Dynamic Random Access Memories, in 43rd annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 2010 [7] N.S. Kim, K. Flautner, D. Blaauw, T. Mudge, “Single-v DD and single-v T super-drowsy techniques for low-leakage high-performance instruction caches”, 2004 International Symposium of Low-Power Electronics Design [8] Z. Hu, S. Kaxiras, M. Martonosi , “Let caches decay: reducing leakage energy via exploitation of cache generational behavior”, ACM Trans. On Comp. Systems (TOCS) vol. 20, 2002 [9] R.L. Mattson “Evaluation techniques for storage hierarchies”, IBM Syst. J. vol.9, 1970 • Acknowledgments NSF Award OCI-10-1-57921, DARPA HR0011-13-2-0014, DoE ASCR: Thrifty: an exascale architecture for energy-proportional computing, XSEDE OCI-1053575, NSF Award CCF-1451598, Intel

More Related