Klaus Waldschmidt J. W. Goethe-University Technische Informatik Frankfurt am Main, Germany

Reliability-Aware Power Management Of Multi-Core Systems (MPSOCs) Klaus Waldschmidt J. W. Goethe-University Technische Informatik Frankfurt am Main, Germany waldsch@ti.informatik.uni-frankfurt.de

Agenda • Multi-Core embedded systems and • Multi-Core platforms in future digitalHardware (reconfigurable) analogHardware Software Reliability Reliability and Power- Management • Problems: • Performance: Algorithms, programming model • Power Management: Energy reduction • Reliability: Increase of lifespan and robustness Perform- ance Power Manage- ment

Power Management • Static and dynamic power management • Dynamic power management: • Reacts dynamically to workload variation • Scales the power consumption of the system and/or system parts with • Frequency scaling • Dynamic voltage scaling (dynamic power) • Adaptive Body Biasing (leakage power) • Clock-gating • Supply shutdown

From: K.Mihic, T. Simunic, G. de Micheli: „Reliability and Power management of integrated systems, DSD ‘04 Power Management and Reliability • The reliability of a digital system is affected by power management in two ways: • It tends to lower the system’s temperature Reliability increases • It introduces thermal cycling Reliability decreases • de Micheli et al. investigated the effects of power management on the long-term reliability of microprocessors • Simulations of power-managed and non power managed cores with small feature size show a decline in reliability for power-managed systems  Reliability Aware Power Management

Power Management and Dynamic Parallelism • Power management for multicore systems is more sophisticated than for single cores: • The required performance depends on the parallelizability of the task(s) running on the system • To reduce power consumption of the system, cores can be • put to lower frequency modes or • put to sleep mode or • switched off • To increase the performance of the system, cores can be • put to higher frequency modes or • woken up from sleep mode or • switched on • A system which is able to control its performance and power consumption according to the parallelizability of tasks has to support dynamic workload distribution dynamic adding and removing of cores

Comm. ? Communication Distribution Adaptivity Virtual Machine # cores heterogeneity The Self Distributing Virtual Machine (SDVM) Application to be run on heterogeneous hardware (MPSOCs and reconfigurable HW) The SDVM as a middleware between application and hardware Application runs transparently distributed on several sites application site application SDVMdaemon application SDVMdaemon SDVM … Core A Hardware A Core B Hardware B network

The SDVM as a middleware for MPSOCs besides computer clusters and grid computing, the SDVM targets also multicore chips and SOCs in future projects FPGA multicore chip middleware for several processors increase number of sites if needed LFM HFM HFM HFM LFM OFF use available space on the FPGA implement special functionality on the FPGA reconfigure at runtime HFM HFM SLP HFM SLP OFF HFM: high frequency mode LFM: low frequency mode SLP: sleep mode OFF: off processor HW function

Modeling of Reliability Aware Power Management for Multicores • We investigated different power management strategies for multicore systems with dynamic workload distribution • The cores are assumed to offer four different PM-states: • HFM (high frequency mode) • LFM (low frequency mode) • SLEEP • OFF • three different power management policies were considered: • fast-upgrade – tries to optimize performance (represents usual power management • low temperature – tries to minimize temperature • smooth temperature – tries to minimize thermal cycling • The simulations were performed using the SDVM with four cores

no average workload > MAX ? average workload < MIN ? no yes yes cores in SLEEP- mode or OFF-mode present ? cores in HF-mode present ? no yes yes no cores in LF-mode present, which haven’t executed applications for more than T sec.? cores in LF-mode present ? no no yes yes Switch all cores in LF-mode to HF-mode. Among those, choose core with highest temperature for tran-sition to LF-mode. Among those, choose core with lowest temperature for tran-sition to HF-mode. Among those, choose most unengaged core for transition to SLEEP-mode. The fast upgrade policy

Example run - fast upgrade policy • One core always in HF-mode  high temperature of core 1 • Maximum temperature 86°C • The temperature TJ of a core is determined out of its power consumption by the formula

no no average workload > MAX ? cores in HF-mode with temperature >TEMPMAX present? average workload < MIN ? yes yes no yes cores in SLEEP- mode or OFF-mode present ? cores in HF-mode present ? no no yes yes average work-load >MAX2 for more than T sec. and cores in LF-mode with temperature <TEMPMAX present ? put this core to LF-mode >1cores in LF-mode present ? no no yes no cores in SLEEP- mode present ? yes Among those, choose core with lowest temperature for tran-sition to LF-mode. yes Among those, choose core with highest temperature for tran-sition to LF-mode / resp. SLEEP-mode / resp. OFF-mode. Among those, choose core with lowest temperature for tran-sition to HF-mode. The low temperature policy

Example run - low temperature policy • thermal cycling with low magnitude but high frequency

no average workload > MAX ? average workload < MIN ? no yes yes cores in SLEEP- mode present ? cores in HF-mode present ? no no yes yes cores in LF-mode present, which haven’t executed applications for more than T sec.? cores in LF-mode present ? no no yes yes Among those, choose core with highest temperature for tran-sition to LF-mode. Among those, choose core with lowest temperature for tran-sition to LF-mode. Among those, choose core with highest temperature for tran-sition to HF-mode. Among those, choose most unengaged core for transition to SLEEP-mode. The smooth temperature policy

Example run - smooth temperature policy • Maximum temperature 86°C • thermal cycling with higher magnitude but lower frequency

Reliability and Temperature • The correlation of reliability and temperature is based on the Arrhenius equation, which gives in terms of mean time to failure (MTTF): • The models of the major electrical failure mechanisms are based on this equation, e.g. for electromigration, we have • The effect of thermal cycling on reliability can be modeled by the Coffin- Manson relation, which gives the number Nf of cycles to failure: • These formulas were used to determine the acceleration factor (AF) with respect to MTTF resp. Nf to compare the three PM-policies to the non- powermanaged case.

Results AFT : Acceleration Factor of Failure due to Temperature AFTc: Acceleration Factor of Failure due to Thermal cycling (mean over all cores)

Conclusion • We tried to asses the impact of different DPM-strategies for multi-core systems on the long-time reliability • No detailed assumption (structure, feature size,…) were made regarding the cores • Failure acceleration due to temperature is more or less similar for the three PM-policies • The smooth-temperature policy performs better by a factor of 2.7 regarding acceleration due to thermal cycling, with almost no performance loss compared to fast-upgrade, but less power saving • This exhibits a clear trade-off between reliability, performance; and power consumption  Parallelism can be used to optimize this trade-off

Thank you for your attention!

Klaus Waldschmidt J. W. Goethe-University Technische Informatik Frankfurt am Main, Germany

Klaus Waldschmidt J. W. Goethe-University Technische Informatik Frankfurt am Main, Germany

Presentation Transcript

Goethe University Frankfurt

Ulrich Achatz Goethe- Universität Frankfurt am Main

City of Frankfurt am Main

Frankfurt am Main

WHY GERMANY and FACHHOCHSCHULE FRANKFURT AM MAIN? Modern Architecture Movement

Klinikum der Johann Wolfgang Goethe Universität Frankfurt am Main

Christoph Sarrazin J. W. Goethe-University Hospital Frankfurt am Main, Germany

Prof. Dr. W.-H. Boehncke Dept. of Dermatology Johann Wolfgang Goethe-University Frankfurt/Main

Klaus J. Kohler University of Kiel, Germany

Goethe University Frankfurt, Germany

Department of Finance, Goethe University Frankfurt, Germany

Thomas Lemke Goethe University Frankfurt am Main Department of Social Sciences

Johann Wolfgang Goethe-Universität Frankfurt am Main

Alexander Botvina FIAS, Goethe University, F rankfurt am Main (Germany) ,

Thomas Lemke Goethe University Frankfurt am Main Department of Social Sciences

Frankfurt am Main-BRD , 2015

Coaching Frankfurt Am Main