Data movement in Dark Silicon Systems

Data movement in Dark Silicon Systems Pietro Cicotti UCSD/SDSC, Performance Modeling and Characterization lab (PMaC) Dagstuhl Seminar, Dark Silicon: from Embedded to HPC Systems, February 1st 2016

Locality changes with Dark Silicon • Power on phase • New resources become available • Memory coming online is empty • Processing resources may need to access or fetch data • Power off phase • Updates must be flushed • On/Off cycles imply data movement and changes in locality

Example Computational sprinting • A number of dark cores are activated and participate to a computation • HPC: threads are spawned (or resumed) and occupy the resources working collaboratively • Is data shared? Read-Only? Streamed or Reused? • Is architecture homogeneous? Do dark cores have same caches? • What determines the duration of the sprint? • Thermal capacity available is independent of computational phases • Locality changes with the resources utilized

Example Invasive Computing • More flexibility than sprinting • Application/Runtime-System is in charge • Request, use, and release resources • Different kinds of resources • Compute, store, communicate • Data moving in and out while invading and retreating • Needs understanding of data access pattern to devise best invasion strategy • Interaction with system required to match goals with resources available

PMaC Work on Modeling Tools and Models • Develop tools and models • Understand application’s behavior • Binary Instrumentation, HW Counters • Understand system’s behavior • HW Counters, Simulation • Models • application+system characterization estimate performance/efficiency • Examples: • Model systems, emerging technology, devise optimizations.

Configurable Memory Hierarchy Exascale report, Peter Kogge, 2009 [4] If the memory hierarchy can be partially powered • Idea: Increase efficiency by customization • Accelerators, reconfigurable hardware, heterogeneous systems, etc. • Focus of this work is on-chip memory • E.g. 20% of CPU power draw • Configure caches • Power on/off cache levels • Resize levels (power on/off banks) • How to select optimal configuration?

Workload Characterization Locality characterization • Analyze address stream • Binary instrumentation • Cache simulations Benchmarks • 37 benchmarks • HPC, DoD, DoE, bio, and data mining

Hardware Configurations L3 size L1 size L2 size 2652 configurations space searched • Dimensions: • Size: 8KB-64MB • 1-3 levels • Ln*2≤Ln+1 • Parameters tuned for each size • Associativity, banking, etc. approximate current processors • Dynamic energy, leakage, latency, estimated with CACTI [5] • DRAM • 40pJ/bit, 400 cycle latency [6] • Reference: 32KB L1, 256KB L2, 2.5MB L3/core • Estimate performance and energy variations using binary instrumentation and simulations of caches

Searching Configurations Optimization: energy • Avg. 5% energy, 0.7x performance • Optimization: energy & performance • Avg. 30% energy, 1.2x performance • + performance • 1x speedup • - performance

Variations on the configurations • Restricted configuration space • Power of 2 sizes, similar to reference, obtained by clustering Assuming leakage reduction • Leakage is significant fraction of power/energy [4,7,8] • No leakage, 50%, 10%, leakage

Automatic Selection Use application characterization to drive selection • Reuse-distance • Architecture independent metric: unique addresses between references [9]

ADAMANT ADAMANT: tools to profile locality of data objects and optimize data movement [2] • Understand impact of new memory technology on workloads • Explore optimizations and configurations • Code refactoring, system configuration • Placement of data objects in memory • Track memory usage • Data placement • Data access patterns

ADAMANT View • ADAMANT Characterization and Object View • Read symbols from application and libraries • Intercept dynamic memory allocation • Fill object database with events from characterization modules • PEBIL • Capture address stream • Simulator generates cache events • Examples • Events used to model performance of different memory configurations

Case Study: NPB-BT • Block Tridiagonal Solver • Stack accounts for large number of references • Hit rates are very high, caches filter most references • fields_ variable has most of memory footprint and references • Hit rates are relatively low, target of almost all DRAM accesses • Example: 2.13x slowdown with fields_ in NVM, 2.02x slowdown with L4

Case Study: Graph500 Representative of a class of data intensive problems • E.g. social networks, bioinformatics • BFS kernel Identify dynamically allocated objects • Even dynamically allocated objects can be named • Programmer assisted step • Match allocation in code with object • stack trace • size

Case Study: Graph500 • Analysis of object view • edgemem: 84% footprint, 88% reads to memory, no writes that reach memory • queues and pred: 99% writes to memory, little footprint • Example: 1.10x slowdown - all NVM, 1.06x queues in DRAM, 1.09x slowdown - L4

Case Study: Velvet • De-novo genomic assembler for short-read sequencing

Case Study: Velvet • Distribution of capacity and references 95% 99%

Case Study: Velvet • Hit rates and load vs. store 96.3% of memory read and write operations

Case Study: Velvet Analysis of object view • Large bins and memory allocation layer obfuscate objects characteristics • Allocation blocks: 95.6% of capacity, 96.3% of memory reads and writes • Example allocation: 2.56x slowdown - all NVM, 2.53x - blocks in NVM, 2.38x - L4

Case Study: Velvet Dividing arrays by usage • Some arrays are in low-reference bins, and the hit rate varies significantly by array • Example allocation: 2.56x slowdown - all NVM, 2.53x - blocks in NVM, 2.38x - L4

Conclusions Data Movement is critical for performance and energy/power efficiency • Investigating and developing tools to characterize and understand data movement • Researching and modeling emerging technologies and future architectures Research relevant and applicable to dark silicon? • Help understanding implications to locality • Model the impact of dark silicon on data movement and applications • Tune configurations and resource usage

Questions References [1] P. Cicotti, L. Carrington, A. Chien, “Toward Application-Specific Memory Reconfiguration for Energy Efficiency”, Int. Workshop on Energy Efficient Supercomputing, 2013 [2] P. Cicotti, L. Carrington, “ADAMANT: tools to capture, analyze, and manage data movement”, submitted to the Int. Conference on Computational Science. [3] A. Suresh, P. Cicotti, L. Carrington, “Evaluation of emerging memory technologies for HPC, data intensive applications”, Int. Conference on Cluster Computing (Cluster) 2014 [4] P. Kogge. at. al. “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems”, CSE Dept. Tech Report TR-2008-13, 2008 [5] S. Thoziyoor, J.H. Ahn, M. monchiero, J.B. Brockman, N. P. Jouppi, “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies”, 35th annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 2008 [6] T. Vogelsang, “Understanding the Energy and Consumption of Dynamic Random Access Memories, in 43rd annual ACM/IEEE International Symposium on Microarchitecture (MICRO), 2010 [7] N.S. Kim, K. Flautner, D. Blaauw, T. Mudge, “Single-v DD and single-v T super-drowsy techniques for low-leakage high-performance instruction caches”, 2004 International Symposium of Low-Power Electronics Design [8] Z. Hu, S. Kaxiras, M. Martonosi , “Let caches decay: reducing leakage energy via exploitation of cache generational behavior”, ACM Trans. On Comp. Systems (TOCS) vol. 20, 2002 [9] R.L. Mattson “Evaluation techniques for storage hierarchies”, IBM Syst. J. vol.9, 1970 • Acknowledgments NSF Award OCI-10-1-57921, DARPA HR0011-13-2-0014, DoE ASCR: Thrifty: an exascale architecture for energy-proportional computing, XSEDE OCI-1053575, NSF Award CCF-1451598, Intel

Data movement in Dark Silicon Systems

Data movement in Dark Silicon Systems

Presentation Transcript

Embedded Systems in Silicon TD5102 Other Architectures

Hierarchical Power Management for Asymmetric Multi-Core in Dark Silicon Era

DARK MATTER IN EARLY LHC DATA

MOVEMENT: Systems and Flows

Mining Following Relationships in Movement Data

Exploiting Dark Silicon for Energy Efficiency

The Dark Silicon Implications for Microprocessors

Dark Romanticism and the Gothic Literature movement

Dark Silicon Phenomenon

Silicon Data Acquisition

Silicon Tracking Systems in Mokka Framework

Troubleshooting Data Movement

Data Movement Instructions

Data Movement Instructions

Data Movement Instructions

Data Movement Instructions

Data Movement Instructions

Data Movement Instructions

Deep Etching Systems for Silicon and Silicon Dioxide

VL Motion Systems Inc-precision in movement

PECULIARITIES OF DARK CONDUCTIVITY IN IRRADIATED SILICON