High performance computing for a family of smooth trajectories using parallel environments

High performance computing for a family of smooth trajectories using parallel environments Bologna, March 23 - 26, 2004 Gianluca Argentini Advanced Computing Laboratory gianluca.argentini@riellogroup.com

Introduction - 1 • The company: • products for heating and conditioning • development and production of residential and industrial burners • presence of a Center of Excellence for the study of combustion and flame processes • R&D Department, extensive CAD (Catia from IBM-Dassault Systemes) and FLUENT computations 1

Introduction - 2 • Industrial and power burners have particular requirements: • customized study of combustion head • study of accurate geometry of combustion chamber (shape of the flame, flow of gas or oil and oxygen) • ventilation and air circulation fans for a correct oxygen supply, right pressurization and continous cooling • reduction of vibrations and noise 2

Introduction - 3 • Rapid prototyping for optimal shape of combustion head and combustion chamber involves Computational Fluid Dynamics: • tracing of air or gas flow particles streamlines • shape of the flow in a generic geometry • High graphic resolution requires a large amount of particles paths: • strong computational memory-expensive and cpu-based effort • distribution of paths on a multiprocessor environment 3

The problem Focus on numerical simulation of flows (in combustion head, chamber or in fans mechanism) The large numerical output of simulation is generated by Navier-Stokes (use of FLUENT package) or Cellular Automaton models (MATLAB package) • From data, we would obtain: • path tracking of fluid particles, useful for customized design of combustion heads and chambers • smooth 3D visualization of particles trajectories, possibly with continuous slope and curvature (analitically: class C2) 4

About problem treatment • Step 1. The data obtained from simulation model are treated by an algorithm for the computation of algebric curves (cubic polynomials) associated to particles paths: • block-data distribution for parallel computing • necessity of continuous reallocation in RAM • Step 2. Evaluation of polynomials on a large set of values for fine resolution: • very expensive CPU computation • sets of curves distribution on processors, no communication Data Algebric curves Massive Computing 5

Fitting the trajectories From simulation, a single particle trajectory is a set of 3D points: • S is the number of points • M is the number of trajectories Interpolation of the points: • Bezier-like is not realistic in case of twist or divergence of speeds field • Chebychev or Least-Squares-like are too rigid for a customized application • polinomial fitting is simple but often shows spurious effects as Runge-Gibbs phenomenon We think a splines-based technique is more useful 6

The splines-based algorithm Let S = 4 x N : path is divided into four-points groups For every group the points are interpolated by three cubic polynomials imposing four analytical conditions: • passage at Pk point, 1 £ k £ 3 • passage at Pk+1 point • continuous slope at Pk point • continuous curvature at Pk point For smooth rendering and for avoiding excessive twisting of trajectories, the cubics uk are added to the Bezier curve b associated to the four points: v = ab + buk 0 < a, b < 1 7

Finding the splines We have choosea = b = 0.5 Let b = As3 + Bs2 + Cs + D (0 £ s £ 1) the Bezier curve of control points P1,…,P4; for every spline uk = at3 + bt2 + ct + d (0 £ t £ 1) the coefficients are computed by (2 £ k £ 3, for k = 1 the formulas are slightly different but of the same algebraic form; a, b, c, d are 3-dimensional cartesian vector) a = Pk+1 - Pk - 3B - C - 6 b = B + 3 (1) c = 2B + C + 3 d = Pk 8

A matrix for splines The system (1) can be represented asc = T b (matrix-vector multiplication) where c = (a, b, c, d) b = (Pk+1, Pk, B, C, 1) 1 -1 -3 -1 -6 T = 0 0 1 0 3 0 0 2 1 3 0 1 0 0 0 For every spline, only the vector b is variable; for a single trajectory, it must be reassigned in RAM every group of two points, after the computation of the relative Bezier curve. 9

A global matrix for splines If we define a global matrixÆas T 0 . . . 0 with0as 4 x 5 zero-matrix, we have a 4M x 5Msparse matrix (optimization of memory storage in MATLAB) 0 T . . . 0 Æ= . . 0 0 . . 0 T and with B = (Pk+1, Pk, B1, C1, 1, . . ., Pk+1, Pk, BM, CM, 1) we can compute for every two-points group the coefficients of cubic splines for all the M trajectories: C = Æ B 10

Computational complexity analysis • Every four-points group, for the M trajectories the flops (floating point operations) number for computing the splines coefficients is: • for Bezier curves (customized Matlab script): 316M • for Æ matrix-vector multiplication (upper estimate): 324M • We have N groups of four-points at every trajectory: the total flops number of the Step 1is about 640MN 11

A parallel distribution for splines With P, number of processes, divisor of M,the method used is the distribution of M/P trajectories (rows of Æ matrix) to every process; no communication is involved. The value of M is important for the occupation of RAM at every computational node. M pP . . p2 p1 N linear execution for every process 12

Computing splines: hardware and software • Bezier curves and splines computation on • Linux cluster IBM x330, biprocessor Pentium III 1.133 GHz, at CINECA (2003); C routines and MPI (for parallel startup and data distribution) • 2 nodes Windows2000 / Linux RedHat IBM x440, biprocessor Xeon 2.4 GHz Hyper Threading, 2 GB RAM, at Riello (2003); MATLAB rel. 6.5 scripts (startup of simultaneous multi-engine) 13

Computing splines: performance results Beowulf CINECA: The registered speedup is quasi-linear; for high value of P the amount of data distribution (M variable) among processes is more intrusive. X440 cluster: Better performances of Win2k (linear speedup) - compared with Linux - with Intel HT technology 14

Post-processing for splines • Now we would a fast method for computing the splines values in a set of parameter ticks with fine sampling. • The CFD packages have some limits in the post-processing phase: • resolution based on pre-processing mesh • rigid (when possible) load distribution among available processors For good graphic visualization, the interval between two data-points might be divided in a suitable number of ticks: 15

Valuating the splines Let V + 1 the number of ticks for each cubic spline valuation; then the ticks are (0, 1/ V, 2/ V, . . ., (V -1)/ V , 1) and the values of splines parameter in the computation are their (0, 1, 2, 3)-th degree powers. The value of a cubic at t0can be view as a dot product: at03+ bt02+ ct0 + d = (a, b, c, d)·(t03, t02, t0, 1) 0 (1/ V)3 . . . . ((V -1)/ V)3 1 LetÓthe pre-allocable constant 4x(V+1) matrix: 0 (1/ V)2 . . . . ((V -1)/ V)2 1 0 (1/ V)1 . . . . ((V -1)/ V)1 1 1 1 . . . . 1 1 16

An eulerian view LetÂthe M x 4 matrix (each row a spline for each trajectory): a1 b1 c1 d1 a2 b2 c2 d2 . . . . aM bM cM dM Then the Mx (V+1) matrix productÕ= Â Ócontains in each row the values of a cubic between two data-points, for all the M trajectories (eulerian method). For the product, the flops are 21M(V+1), the number of matrices Õ is 3N; the total number of flops are 63NM(V+1). 17

A lagrangian view LetÏthe 3N x 4 matrix (each row a spline along one single trajectory): a1 b1 c1 d1 a2 b2 c2 d2 . . . . a3N b3N c3N d3N Then the 3Nx (V+1) matrix productÔ= Ï Ócontains in each row the values of a cubic between two data-points, for a single trajectory (lagrangian method). For the product, the flops are 63N(V+1), the number of matrices Ô is M; the total number of flops are 63NM(V+1). 18

Data distribution: eulerian case With P, number of processes, divisor of 3N (amount of two-points groups),the method used is the distribution of 3N/P Â matrices to every process; no communication is involved. The value of N is important for the total computation time, N and M for the RAM allocation of each process. 3N CPU . . . . . M RAM 19

Data distribution: lagrangian case With P, number of processes, divisor of M (amount of trajectories),the method used is the distribution of M/PÏ matrices to every process; no communication is involved. The value of N is important for the total computation time, N and M for the RAM allocation of each process. 3N RAM . M CPU 20

Hardware and software Hardware: 2 x { IBM x440, 2 Xeon 2.4 GHz HT, 2 GB }, at Riello Software: Windows2000 / Linux RH 8.1, MATLAB 6.5, parallelism of simultaneous Matlab engines • for matrix multiplication, Matlab 6.5 uses internal LAPACK Level 3 BLAS routines (good performances) • the Ó matrix is computed only one time (in case of uniform and costant sampling interval), its values are probably always cached during matrices multiplication 21

Performance results Performances of a single Matlab process for the Â Óproduct with V = 100; as theory, the execution time is linear on M variable. Performances of multiprocess products (case 3N = 4200P); for P £ 8, the total computation time depends on NM (Gustafson law), as expected. 22

Performance results: considerations • Linear speedup until P=8 (= number of virtual Hyper Threads processors); for P³8 reallocations of RAM and caches have a negative effect • For large data sets, the amount of RAM in the nodes of cluster is a critical factor, while the CPUs performances are good with the use of LAPACK routines • First results with a technique using “global M-N” matrices, an MPI-multithreads version of MATLAB (Cornell Toolbox), and parallel matrix multiplication algorithms, show an overhead, in case of large data, due to communications 23

Performance results: Hyper Threading Performance of Intel Hyper Threading Technology of Xeon processors; the vertical unit is time execution in the case of 8 processes (M=5000,3N = 4200P); until 8, the time seems to be quadratic on processes number. • Similar results have been obtained • using Win2k or Linux • using High Performance Linpack benchmarking 24

Examples red = trajectory computation with V = 100; black = least squares method, 3° degree polynomials; gray = data-points from simulation Forced injection of air in combustion head; the ribbons show some particles trajectories; data-points from simulation, paths computation with V=100, M=5000, N=1600, P=8; computation and rendering by Matlab; total computation time 85 secs Thanks

High performance computing for a family of smooth trajectories using parallel environments