Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI-02-0093RD Project Review Meeting Canadian Meteorological Centre August 22-23, 2006

Outline • Introduction • Numerical methods • Parallel load-balancing with space-filling curves (SFC) • Data distribution • Adaptive mesh refinement and derefinement • Construction of the ghost boundary cells for each processor • Discretization of the Poisson equation • Parallel multigrid preconditioner with conjugate gradient method • Numerical results • Conclusions

Introduction • Each Node represents a block of cells. • Advantage: The cells in each block can be organized as two or three dimensional arrays. The structured grids solver can be used without too many modifications for AMR. • Disadvantage: It is inflexible. A substantial number of cells can be wasted on a smooth flow. Structured adaptive mesh refinement (AMR) Block-structured AMR

Introduction • Each Node represents a cell. • The mesh is only locally refined in contrast to the block-structured AMR. • It is more flexible, and computationally more efficient than the block-structured AMR. • The cell-based AMR is chosen in the present paper. Cell-based AMR

Introduction • The cells can be organized as a quad-tree for 2D, or oct-tree for 3D. • For a oct-tree structure, it needs 17 words of memory if the connectivity information is explicitly stored. • If it is not explicitly stored, a tree may be traversed up to its root to find the required neighboring cell. • It is difficult to parallelize because a search may be extended from one processor to another processor. An ordinary tree data structure

Introduction Fully Threaded Tree (FTT) structure • All cells are grouped together as Octs. • The memory overhead is significantly reduced. • The maintenance of an octal FTT requires about three words of memory per cell instead of 17 words in the ordinary oct-tree. An oct-tree structure in a FTT

Introduction Fully Threaded Tree (FTT) structure • The west and south neighbors of cell 6 can be found directly through its explicitly stored parent Oct. • The east and north neighbors of cell 6 can be found through the neighboring cells of its parent Oct. • No more than one level of the tree needs to be traversed to access the neighbors of a cell. An example to access the neighbors of a cell without searching using FTT structure.

Introduction • Objective: • Propose a new parallel approach to the AMR code based on the FTT data structure

Numerical methods Parallel load-balancing with space-filling curves (SFC) • SFC is chosen as the grid partitioner due to its mapping, compactness and locality. • The points in the higher dimensional space can be mapped to the corresponding points on a line. • Only the coordinates of the point in the higher dimensional domain are required to compute the corresponding location on the 1D line. • In the Hilbert ordering, all adjacent neighbors on the 1D line are face-neighboring cells in the higher dimensional domain (locality). Space-filling curves in two dimensions: (a) Hilbert or U ordering, (b) Morton or N ordering.

Numerical methods Parallel load-balancing with space-filling curves (SFC) • The different colors correspond to partitions on the different processors. • Only leaf cells are shown in the left figure. The two-dimensional adaptive grids partitioned on four processors with the Hilbert SFC.

Numerical methods Data distribution • A unique global ID is used to identify each cell instead of the local ID on each processor. • Not stored processor ID for each cell, which can be computed from its spatial coordinates using SFC. • Hash-table technique is applied to store the cells and oct structures on each processor. • If a cell is marked to be migrated to another processor by the Hilbert SFC, both of the data in the cell and the corresponding oct structures have to be migrated. The global ID is used to identify each cell.

Numerical methods Adaptive mesh refinement and derefinement • Constraint: no neighboring cells with level difference greater than one are allowed. • Cell A is marked to be refined • Check the neighboring cells of the parent cell of cell A (i.e., cells B and D), if the neighbors are leaves, they are marked to be refined. • If cells B and C belong to two different processors, send the global ID of the neighbor of the parent cell of cell B to the processor where cell C resides. A example showing how to flag cells to be refined over 2 processors

Numerical methods Adaptive mesh refinement and derefinement Before and after enforcing the refinement constraints on 4 processors

Numerical methods Adaptive mesh refinement and derefinement • if cell A is marked to be coarsened • All the children cells of cell A should be leaves. • If any neighboring cell is not a leaf, check its nearest two children cells. • If the nearest two children cells are not leaves, and they are not marked to be coarsened, cell A cannot be coarsened. An example showing how cell A is coarsened without violating the constraint.

Numerical methods Construction of the ghost boundary cells for each processor • The corresponding oct data structure has to be generated to make the boundary cells find their neighboring cells. • Seven cells in each neighbor direction should use oct A to find their neighboring cells. • Hilbert coordinates of all neighboring cells are computed to obtain their processor ID. • The data in the oct A will be sent out to the processors where all the related neighboring cells reside. The neighboring cells related to the Oct A in the FTT data structure.

Numerical methods Construction of the ghost boundary cells for each processor • The ghost boundary cells for each processor can be determined based on the oct data structures. The local leaf cells together with their corresponding ghost leaf boundary cells on two processors

Numerical methods Discretization of the Poisson equation • Poisson equation: • second-order accuracy • using the cell-centered gradients to approximate the value at the auxiliary node • The least squares approach is used to evaluate the cell-centered gradients. Approximation of the gradient flux Fe based on values at the node E and the auxiliary node P'

Numerical methods Parallel multigrid preconditioner with conjugate gradient method • Additive multigrid method: • The smoothing can be performed simultaneously (or in parallel ) at all grid levels. • Better parallel performance than the classical multigrid method • not convergence if used as a stand-alone smoother • as a preconditioner combined with the conjugate gradient method. A sketch of the V-cycle additive multigrid method.

Numerical results • Considering a 2D Poisson equation • The computational domain is • The Neumann boundary conditions are used on the four boundaries. • The parallel efficiency are tested on the cluster of computers in SHARCNET.

Numerical results • Uniform grids: • Using more processors does not always reduce the time. For the cases corresponding to levels less than 8, the times increase from 16 to 32 processors due to the domination of the communication times. • As the problem becomes bigger, the parallel efficiency is improved because of the domination of the computational times. • For the last case, a parallel efficiency 98% is achieved with 64 processors. The wall clock times on regular grids from level 5 to 10 with up to 64 processors

Numerical results • AMR grids: • The leaf cells are refined if is larger than the mean value. • For problems with large grid sizes, the times decreases monotonically as the number of processors increases. • For the last case, a parallel efficiency of 106% (>100%) is achieved due to efficient use of cache memory when the grid size in each processor becomes smaller. The wall clock times on AMR grids with up to 64 processors

Numerical results • The grid partitioning and mapping times using the Hilbert SFC: • The percentage increases slightly when a larger number of processors are used because a large amount of data have to be migrated over a larger number of processors. • The ratio of the load balancing times to the total computational time is only 0.22% in the case of 64 processors. • The proposed method is very efficient. The wall clock times associated with the load balancing procedure for an adaptive grid on the different processors.

Conclusions • FTT data structure is used to organize the adaptive meshes because of its low memory overhead and accessing neighboring cells without searching. • The Hilbert SFC approach is used to dynamically partition the adaptive meshes. • The numerical experiments show that the proposed parallel Poisson solver is highly efficient.

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation

Presentation Transcript

Mesh refinement: sequential, parallel, and dynamic

Parallel Block Adaptive Mesh Refinement For Multiphase Flows

Scalable Algorithms for Structured Adaptive Mesh Refinement

Infrastructure for Parallel Adaptive Unstructured Mesh Simulations

Adaptive Algebraic Multigrid

Solving the Discrete Poisson Equation using Multigrid

Adaptive Mesh Refinement MHD

Adaptive mesh refinement in astrophysical simulations

Adaptive Hybrid Mesh Refinement for Multiphysics Applications

Adaptive Mesh Refinement MHD for Magnetic Fusion Applications

Production Visualization Tools for Adaptive Mesh Refinement Data

A Parallel Hierarchical Solver for the Poisson Equation

Adaptive Mesh Refinement

Adaptive mesh refinement

Toward Automatic Parallel Adaptive Mesh Refinement

An Unsplit Ideal MHD code with Adaptive Mesh Refinement

A Parallel Hierarchical Solver for the Poisson Equation

Adaptive Mesh Refinement MHD for Magnetic Fusion Applications

Visualization Tools for Adaptive Mesh Refinement Data

Tetemoko Adaptive Mesh Refinement for Dynamic Rupture Simulations

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement

Adaptive mesh refinement