Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 (System Simulation) Regional Computing Center of Erlangen (RRZE)

Outline • Introduction • Lattice Boltzmann Method • Implementation Aspects • Application • Implementation • Optimization • Results • Conclusion Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Lattice Boltzmann Method • Boltzmann Equation • Discretization of particle velocity space (finite set of discrete velocities) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Lattice Boltzmann Method • Different discretization schemes • Numerical accuracy and stability • Computational speed and simplicity D3Q15 D3Q19 D3Q27 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Lattice Boltzmann Method • Discretization in space x and time t: collision step: streaming step: Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Lattice Boltzmann Method (Implementation Aspects) • Discretization in space x and time t: collision step: streaming step: • Stream-Collide (Pull-Method) • Get the distributions from the neighboring cells in the source arrayand store the relaxated values to onecell in the destination array • Collide-Stream (Push-Method) • Take the distributions from one cellin the source array and store therelaxated values to the neighboringcells in the destination array W source destination Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Lattice Boltzmann Method (Implementation Aspects) • Walls and Obstacles: Bounce Back rule Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation Aspects • Data Dependencies • Two Grids • Compressed Grid  Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation Aspects double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1) do z=1,zMax do y=1,yMax do x=1,xMax if( fluidcell(x,y,z) ) then LOAD f(x,y,z, 0:18,t) Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1) SAVE f(x ,y+1,z , 2,t+1) SAVE f(x-1,y+1,z , 3,t+1) … SAVE f(x ,y-1,z-1,18,t+1) endif enddo enddo enddo Collide Stream Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Application Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

P C Porous Media Combustion Application: Porous Media Combustion • New technology for heating installations: Porous Media Combustion • Fuel-air-mixture does no longer react in a free flame • Combustion process takes place inside the pores of a porous medium that is placed in the reaction area 60mm 20mm Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

P C Porous Media Combustion Application: Porous Media Combustion Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

P C Porous Media Combustion Application: Porous Media Combustion • Various applications: modern steam engines, vehicle heaters Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Application: Introducing Complex Geometries used • Porous Medium from PMC: Silicon-Carbide (SiC) • Many small obstacles • Obstacle/fluid-ratio: ~2% • High number of fluid-solid faces Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Application: Introducing Complex Geometries used • Second Test-Geometry: • “MC“ • Huge obstacles, only fewfluid tubes • Obstacle/fluid-ratio: ~50% • Low number of fluid-solid faces Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation • Collision and streaming step in same loop • Push-Method • Data representation in 1D-Array • Stores only Fluid Cells ( saves memory) • Indirect addressing of target cells (by extra connectivity array) • Boundary conditions (Bounce Back) handled implicitly Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation • Indirect addressing and implicit Bounce Back obstacle wall i-1 i i+1 i-1 i i+1 connectivity array Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation • Preprocessor • Sets up connectivity array • Specifies all domain parameters: • Geometry and obstacles • Traversing scheme • Solver • Reads in preprocessed information • Performs lattice Boltzmann method (in single loop) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Implementation Rating: 1D-Array compared to standard implementation using multidimensional array and three loops • Advantages • Saves memory The more obstacles in the domain the higher is the compression • Implicit Bounce Back No extra routine or if-statement needed • Drawbacks • Indirect addressing Prevents compiler from vectorization and other optimization techniques  Consequently, worse performance Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Optimization - Outline • Memory Traversing Schemes • Space-Filling Curves • Blocking • Memory Layouts • Further Optimization Techniques Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes: Space-Filling Curves • What is a space-filling curve? Loosely spoken: A one dimensional curve that fills a higher dimensional space • Which curves were used? • Hilbert • Peano • How are they constructed? Again, loosely: By a mapping from a one-dimensional interval to the higher dimensional space Then, by recursion new mapping of each part of the interval Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes: Space-Filling Curves • And how works construction really? 1 0 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

“Hilbert bne“ “Hilbert nwf“ “Hilbert fws“ “Hilbert enb“ “Peano fsw“ Memory Traversing Schemes: Space-Filling Curves • How does that look in 3D? • OK, but how to construct them? Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes: Space-Filling Curves • How to construct them? • Table based approach • Hilbert: 48 Productions with 8 entries and 7 connectors • Peano: 8 Productions with 27 entries and 26 connectors Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes: Space-Filling Curves • Summary for Space-Filling Curves • Recursive production by segmentation Limitation in system sizes: • Hilbert: 23n • Peano: 33n • Increase spatial locality • Enable mesh-refinement Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes: Blocking • Implicit blocking technique • Arrangement of data in a blocking manner • Increases spatial locality Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes • Notes on Memory Traversing Schemes Pure preprocessing technique: • Only arrangement of data in memory changed • No change for solver • No overhead for solver (e.g. loop overhead for blocking) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts: Collision Optimized Layout • Standard Array-Layout: F(i,x,y,z,t) “Array-of-Structures“ • Collision optimized • Optimal read access:2 cache lines per LUP • Bad write access:19 stores in 19 cache linesBut: Depending on systemsize some of them arealready/still in cache Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts: Collision Optimized Layout • 18 write accesseson one cell from3 different z-layers • 8 write accesseson one cell from3 different rows Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts: Collision Optimized Layout • Performance of Collision-Optimized Layout (P4, 512kB L2) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts: Propagation Optimized Layout • Optimized Array-Layout: F(x,y,z,i,t) “Structure-of-Arrays“ • stride-1-access on x (inner loop) • 19 cache lines per 16 LUPsin read and write process • 1 cache miss each 16th memory access Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts: Propagation Optimized Layout • Performance on Pentium 4, 512kB L2 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Further Optimization Techniques • Additional Bottlenecks • Large loop body (causes register spills on IA32) • Concurrent writing to 19 different cache lines interferes with number of write combine buffers on IA32 (6 for Intel Xeon/Nocona) • Indirect addressing prevents IA32 hardware prefetcher from preloading values for target cells (due to bounce back at obstacles) • Solutions (implemented in Solver): • Split up loop in 5 loops of length Nx • Manual Block Preload Technique (Drawback: Both techniques need a loop blocking scheme) These solutions only needed for IA32 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Results - Outline • Architecture Descriptions • Comparison 1D-Solver to Standard Solver • Memory Layouts • For Standard Solver • For 1D-Solver • Influence of Geometry • MC • SiC • Memory Traversing Schemes • Space-Filling Curves • Blocking Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Architecture Description: Nocona/Irwindale • Test System: Test machine at RRZE • CPUs: Nocona (Irwindale), 3.6GHz, 2MB L2-Cache • Memory: DDR400, 6.4GB/s • Architectural specialties: • EM64T extension • Hyperthreading • One memory bus for both processors Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Architecture Description: Itanium2 / Altix • Test System: RRZE SGI Altix • CPUs: Itanium 2, 1.3 GHz, 3MB L3-Cache • Memory: 112GB “distributed shared memory“ • Architectural specialties: • Itanium 2: • “EPIC“ (Explicitly Parallel Instruction Computing) • No out-of-order • Parallelization of commands in the grip of compiler ( bundles) • L1-Cache only for Integer • Altix: • ccNUMA with NUMALink 3 • Memory connected hierarchically by SHUBs Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Architecture Description: AMD Opteron • Test System: LSS HPC-Cluster • CPUs: AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compatible • Memory: DDR333, 5.2GB/s • Architectural specialties: • Compute nodes with four CPUs • 4GB RAM per CPU, each CPU can access 16GB per ccNUMA • Interconnect: • CPUs on one node: HyperTransport (6.4GB/s) • Nodes: Infiniband (10GB/s) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Comparison 1D-Array Solver to Standard Solver Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts for Standard Solver Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts on Itanium 2 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts on AMD Opteron Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Layouts on Nocona/Irwindale Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Influence of Geometry: SiC-foam (~2% obstacles) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Influence of Geometry: MC (~50% obstacles) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Memory Traversing Schemes Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Conclusion • 1D-Array Data representation makes performance independent of obstacle to fluid ratio • Memory traversing by Space-Filling Curves results in similar performance as spatial blocking • Implementation of SFCs is not worth the effort (if they are used as memory traversing alternative only) • Together with indirect addressing Collision Optimized Layout with blocking is best technique if cache is larger than 1 MB • Indeed, there are cases where Propagation Optimized Layout is not best Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Outlook • Future work could concern: • Space-Filling Curves: • Kind of staggered SFCs, for every direction own curve • Avoid waste of underused cache lines where lattice sites are neighboring cells which are visited much later • Galerkin-discretization or point wise evaluation of LBM to enable stack-implementation in conjunction with SFCs • BUT: For real-world problems construction on non-cubic grids is necessary at first • Search for vectorization enhancing techniques to over-come problems with indirect addressing on Itanium 2 • Search for reasons why Collision Optimized Layout is better than Propagation Optimized Layout Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Presentation Transcript

Lattice Method

Simulation of MHD Flows using the Lattice Boltzmann Method

Lattice Boltzmann Equation Method in Electrohydrodynamic Problems

LATTICE BOLTZMANN SIMULATIONS OF COMPLEX FLUIDS

3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D)

Optimizing Performance

LATTICE BOLTZMANN SIMULATIONS OF COMPLEX FLUIDS

Lattice Boltzmann Equation Method in Electrohydrodynamic Problems

Ionic lattice structures

Standard Lattice Boltzmann Method (LBM) Current LBM Methods for Complex Problems

Artificial Compressibility Method and Lattice Boltzmann Method Similarities and Differences

Simulation of Micro-channel Flows by Lattice Boltzmann Method

Lattice Method

Mesoscale modelling with the Lattice Boltzmann model

Lattice Boltzmann Method

Lattice Method

APPLICATION OF LATTICE BOLTZMANN METHOD

Lattice-Boltzmann method for non-Newtonian and non-equilibrium flows

The Lattice-Boltzmann Method for Gaseous Phenomena

Lattice Method

Introduction: Lattice Boltzmann Method for Non-fluid Applications

3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D)

Sea Ice

Sea Ice