310 likes | 499 Vues
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling. Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan Technological, Texas A&M } University ICCAD 2010. Outline. Introductions Backgrounds
 
                
                E N D
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan Technological, Texas A&M} University ICCAD 2010
Outline • Introductions • Backgrounds • GPU-based full-chip thermal analysis with microchannels • Preconditioned iterative method on GPU • Experimental results and conclusions
Introduction • Effective thermal management for 3D-ICs is becoming increasingly challenging. • Increasing power density and chip design complexity. • Traditional heat sinks are expected to quickly reach their limits for meeting the cooling needs of 3D-ICs.
Introduction (cont.) • The integrated on-chip microchannel cooling has been considered as a very promising solution. • i.e. liquid cooling • An experiment on a liquid-cooled 2D-IC. • Peak on-chip temperature: from 85℃ to 57℃ • Maximum temperature variation: from 25℃ to 6℃
Introduction (cont.) • Existing design and optimization procedure for integrated microchannels are performed without considering the full-chip thermal profiles. • May not provide the most “economic” solution • Drawbacks: design complexity, packaging cost, etc. • Hence, a comprehensive design and optimization flow should be closely coupled with the full-chip thermal analysis.
Why GPUs? • Finite difference (FD) method is more suitable for general 3D full-chip thermal simulations. • Accurate 3D thermal analysis in a full-chip scale using FD method can be very expensive, which requires solving a huge linear system of equations including multi-million unknowns.
Why GPUs? (cont.) • GPU-based parallel computing has been employed in various electrical design automation areas. • Advantages • High computing power in large-scale homogeneous computing, i.e. matrix multiplications • Significantly high memory bandwidth
Contributions • Proposes novel GPU-based full-chip thermal simulation methods for 3D-ICs with integrated microchannel cooling • GPU-friendly data structures and algorithm flows • Proposes a GPU-friendly two-step block relaxation scheme that integrates block-based vertical-line relaxations and liquid-flow-direction relaxations. • Achieves good speedup. • More than 35x fast to the CPU-based solver • More than 360x fast to the direct solution solver
Background – liquid cooling in 3D ICs • The liquid-cooled microchannels are typically integrated inside a wafer-level package, where the microchannels are connected to the liquid inlets and outlets using fluidic through silicon vias (TSVs). • The heat flux can be more effectively removed than ever before since the thermal resistance of such integrated liquidcooled heat sinks can be much lower than the thermal resistance of the traditional fan-cooled heat sinks.
Background – finite difference (FD) method • Replacing derivative expressions with approximately equivalent difference quotients to approximate the solutions to differential equations. • For some small h
Background – full-chip thermal simulation • Discretize the PDE of the original thermal circuit analysis problem by FD method. • Solve GT = b where • G is the thermal resistance matrices. • b is the information about the environment.
Architecture of Nvidia GTX280 • A collection of 30 multiprocessors, with 8 streaming processors each. • The 30 multiprocessors share one off-chip global memory. • Access time: about 300 clock cycles • Each multiprocessor has a on-chip memory shared by that 8 streaming processors. • Access time: 2 clock cycles
GPU-based full-chip thermal analysis with microchannels • Many things need to be considered for obtaining the most “economic” microchannel designs. • Pumping power, placement, sizing, … • Fine-grained thermal modeling and analysis including microchannel cooling is non-trivial due to the high modeling complexity and simulation costs. • Model extraction cost and thermal simulation cost • The characteristic is matched for GPU.
The proposed two-step block relaxation scheme • Considers two directions (Z and Y) of heat dissipations.
Details • In the first step, the nodes that are included in a block of vertical lines are selected for doing relaxations (lines L1 to L3 shown in Fig. 4). Such relaxations allow fast solution updates in the vertical heat dissipation paths within the block. • In the second step, a few relaxations in the microchannel routing direction (liquid-flow direction) are performed to allow heat solution updates in the liquid-flow direction.
But why? • Efficiencies of typical iterative methods usually depend on • Efficiency of the sparse matrix-vector operations • Effectiveness of the relaxation (iteration) scheme • Existing iterative algorithms only focus vertical heat dissipations. • Horizontal (plane) dissipations in traditional 2D ICs are negligible for relatively small thermal conductance • But not in 3D ICs
Preconditioned iterative method on GPU • Two critical issues about run time. • Matrix representation format • Convergence rate of iterative method • Use and ELL-like format and preconditio-ning technique.
Matrix representation format • GPU-based computations should guarantee that most of the global memory accesses are coalesced so that efficient data structure and its related memory accesses should be carefully designed. • Use three 1D vector to fully represent the sparse matrix and fit memory coalescing. • Diagonal, off-diagonal and its corresponding indices • 2x to 3x compared with CSR format.
Conjugate gradient (CG) method • The CG method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. • The CG is an iterative method, so it can be applied to sparse systems that are too large to be handled by direct methods such as the Cholesky decomposition. Such systems often arise when numerically solving partial differential equations • Minimize • Assuming exact arithmetics, CG converges in at most n steps where n is the size of the matrix of the system (here n=2).
Preconditioning • Conjugate gradient (CG) method takes too much iterations since the matrix is usually ill-conditioned. • Condition number • Moreover, the total runtime can be even greater than CG if the preconditioning method is bad or high runtime cost. • Though #iteration is less • Three ways for comparison • CG, diagonal preconditioned (DP)CG, multi-grid preconditioned (MGP)CG
Preconditioning (cont.) • Preconditioning is a procedure of an application of a transformation, called the preconditioner, that conditions a given problem into a form that is more suitable for numerical solution. • Preconditioned system • Preconditioned iterative method • Practical preconditioner
Multi-grid preconditioner • Actually not that clear but the idea is to coarsen the grid to reduce complexity.
Experimental results • Environment • Intel Core 2 Quad 2.66GHz with one NVIDIA GeForce GTX 285 • DRAM: 6G for CPU, 2G for GPU • C++ and CUDA on Linux • Inlet water temperature: 50℃ • A set of 3D design stack 6 2D dies. • Convergence criterion of iterative solver: residual norm < 10^-6. • The error is negligible.
Experimental results (cont.) • Traditional smoothing is vertical line smooth. • Significant speedup of at least 35x.
Conclusions • Proposes GPU-based thermal simulation methods of 3D ICs with integrated liquid-cooled microchannels. • GPU-friendly two-step block-based relaxation scheme. • Highly accurate results with significant speed-up.