240 likes | 247 Vues
Data movement in Dark Silicon Systems. Pietro Cicotti UCSD/SDSC, Performance Modeling and Characterization lab ( PMaC ) Dagstuhl Seminar, Dark Silicon: from Embedded to HPC Systems, February 1 st 2016. Locality changes with Dark Silicon. Power on phase New resources become available
E N D
Data movement in Dark Silicon Systems Pietro Cicotti UCSD/SDSC, Performance Modeling and Characterization lab (PMaC) Dagstuhl Seminar, Dark Silicon: from Embedded to HPC Systems, February 1st 2016
Locality changes with Dark Silicon • Power on phase • New resources become available • Memory coming online is empty • Processing resources may need to access or fetch data • Power off phase • Updates must be flushed • On/Off cycles imply data movement and changes in locality
Example Computational sprinting • A number of dark cores are activated and participate to a computation • HPC: threads are spawned (or resumed) and occupy the resources working collaboratively • Is data shared? Read-Only? Streamed or Reused? • Is architecture homogeneous? Do dark cores have same caches? • What determines the duration of the sprint? • Thermal capacity available is independent of computational phases • Locality changes with the resources utilized
Example Invasive Computing • More flexibility than sprinting • Application/Runtime-System is in charge • Request, use, and release resources • Different kinds of resources • Compute, store, communicate • Data moving in and out while invading and retreating • Needs understanding of data access pattern to devise best invasion strategy • Interaction with system required to match goals with resources available
PMaC Work on Modeling Tools and Models • Develop tools and models • Understand application’s behavior • Binary Instrumentation, HW Counters • Understand system’s behavior • HW Counters, Simulation • Models • application+system characterization estimate performance/efficiency • Examples: • Model systems, emerging technology, devise optimizations.
Configurable Memory Hierarchy Exascale report, Peter Kogge, 2009 [4] If the memory hierarchy can be partially powered • Idea: Increase efficiency by customization • Accelerators, reconfigurable hardware, heterogeneous systems, etc. • Focus of this work is on-chip memory • E.g. 20% of CPU power draw • Configure caches • Power on/off cache levels • Resize levels (power on/off banks) • How to select optimal configuration?
Workload Characterization Locality characterization • Analyze address stream • Binary instrumentation • Cache simulations Benchmarks • 37 benchmarks • HPC, DoD, DoE, bio, and data mining
Hardware Configurations L3 size L1 size L2 size 2652 configurations space searched • Dimensions: • Size: 8KB-64MB • 1-3 levels • Ln*2≤Ln+1 • Parameters tuned for each size • Associativity, banking, etc. approximate current processors • Dynamic energy, leakage, latency, estimated with CACTI [5] • DRAM • 40pJ/bit, 400 cycle latency [6] • Reference: 32KB L1, 256KB L2, 2.5MB L3/core • Estimate performance and energy variations using binary instrumentation and simulations of caches
Searching Configurations Optimization: energy • Avg. 5% energy, 0.7x performance • Optimization: energy & performance • Avg. 30% energy, 1.2x performance • + performance • 1x speedup • - performance
Variations on the configurations • Restricted configuration space • Power of 2 sizes, similar to reference, obtained by clustering Assuming leakage reduction • Leakage is significant fraction of power/energy [4,7,8] • No leakage, 50%, 10%, leakage
Automatic Selection Use application characterization to drive selection • Reuse-distance • Architecture independent metric: unique addresses between references [9]
ADAMANT ADAMANT: tools to profile locality of data objects and optimize data movement [2] • Understand impact of new memory technology on workloads • Explore optimizations and configurations • Code refactoring, system configuration • Placement of data objects in memory • Track memory usage • Data placement • Data access patterns
ADAMANT View • ADAMANT Characterization and Object View • Read symbols from application and libraries • Intercept dynamic memory allocation • Fill object database with events from characterization modules • PEBIL • Capture address stream • Simulator generates cache events • Examples • Events used to model performance of different memory configurations
Case Study: NPB-BT • Block Tridiagonal Solver • Stack accounts for large number of references • Hit rates are very high, caches filter most references • fields_ variable has most of memory footprint and references • Hit rates are relatively low, target of almost all DRAM accesses • Example: 2.13x slowdown with fields_ in NVM, 2.02x slowdown with L4
Case Study: Graph500 Representative of a class of data intensive problems • E.g. social networks, bioinformatics • BFS kernel Identify dynamically allocated objects • Even dynamically allocated objects can be named • Programmer assisted step • Match allocation in code with object • stack trace • size
Case Study: Graph500 • Analysis of object view • edgemem: 84% footprint, 88% reads to memory, no writes that reach memory • queues and pred: 99% writes to memory, little footprint • Example: 1.10x slowdown - all NVM, 1.06x queues in DRAM, 1.09x slowdown - L4
Case Study: Velvet • De-novo genomic assembler for short-read sequencing
Case Study: Velvet • Distribution of capacity and references 95% 99%
Case Study: Velvet • Hit rates and load vs. store 96.3% of memory read and write operations
Case Study: Velvet Analysis of object view • Large bins and memory allocation layer obfuscate objects characteristics • Allocation blocks: 95.6% of capacity, 96.3% of memory reads and writes • Example allocation: 2.56x slowdown - all NVM, 2.53x - blocks in NVM, 2.38x - L4
Case Study: Velvet Dividing arrays by usage • Some arrays are in low-reference bins, and the hit rate varies significantly by array • Example allocation: 2.56x slowdown - all NVM, 2.53x - blocks in NVM, 2.38x - L4
Conclusions Data Movement is critical for performance and energy/power efficiency • Investigating and developing tools to characterize and understand data movement • Researching and modeling emerging technologies and future architectures Research relevant and applicable to dark silicon? • Help understanding implications to locality • Model the impact of dark silicon on data movement and applications • Tune configurations and resource usage
Questions References [1] P. Cicotti, L. Carrington, A. Chien, “Toward Application-Specific Memory Reconfiguration for Energy Efficiency”, Int. Workshop on Energy Efficient Supercomputing, 2013 [2] P. Cicotti, L. Carrington, “ADAMANT: tools to capture, analyze, and manage data movement”, submitted to the Int. Conference on Computational Science. [3] A. Suresh, P. Cicotti, L. Carrington, “Evaluation of emerging memory technologies for HPC, data intensive applications”, Int. Conference on Cluster Computing (Cluster) 2014 [4] P. Kogge. at. al. “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems”, CSE Dept. Tech Report TR-2008-13, 2008 [5] S. Thoziyoor, J.H. Ahn, M. monchiero, J.B. Brockman, N. P. Jouppi, “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies”, 35th annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 2008 [6] T. Vogelsang, “Understanding the Energy and Consumption of Dynamic Random Access Memories, in 43rd annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 2010 [7] N.S. Kim, K. Flautner, D. Blaauw, T. Mudge, “Single-v DD and single-v T super-drowsy techniques for low-leakage high-performance instruction caches”, 2004 International Symposium of Low-Power Electronics Design [8] Z. Hu, S. Kaxiras, M. Martonosi , “Let caches decay: reducing leakage energy via exploitation of cache generational behavior”, ACM Trans. On Comp. Systems (TOCS) vol. 20, 2002 [9] R.L. Mattson “Evaluation techniques for storage hierarchies”, IBM Syst. J. vol.9, 1970 • Acknowledgments NSF Award OCI-10-1-57921, DARPA HR0011-13-2-0014, DoE ASCR: Thrifty: an exascale architecture for energy-proportional computing, XSEDE OCI-1053575, NSF Award CCF-1451598, Intel