PERI Tiger Teams FY07 Report

PERI Tiger TeamsFY07 Report Performance Engineering Research Institute October 30, 2007 Contact: Bronis R. de Supinski bronis@llnl.gov

Tiger Team Process and Milestones • Process from Section 4.3 of Proposal: • Select one or two applications per year • Consult with Office of Science Program Managers • Consist of three to four PERI researchers • Milestones • Q1: Identify applications and teams for current year • Q1: Report on prior year’s teams • Q3: Report progress; reassign as per DOE needs

FY07 Selection Process • Delayed to allow completion of application survey • Received guidance to focus on 3 JOULE metric codes • S3D, GTC and Chimera • Initial discussions w/JOULE Metric coordinator K. Roche • SciDAC PI meeting in Atlanta in January of 2007 • Strong interest from both S3D and GTC • Chimera expressed concerns over their staffing and needs • Narrowed to focus on S3D and GTC in early March

FY07 Tiger Team Formation • Solicited interest in participating on teams • Assignments made by PERI management based on: • Perceived code team needs • Prior engagement activities • Balance of expertise • Participants from six of nine PERI institutions • Also strong participation in both teams by Univ. of Oregon • Coordination • Team-specific mailing lists • Regular telecons

S3D Tiger Team • Team Lead: Bronis de Supinski (LLNL) • PERI Team Members • John Mellor-Crummey, Mike Fagan (Rice) • Nick Wright, Allan Snavely (SDSC) • David Bailey (LBNL) • Rich Vuduc (LLNL) • Affiliate Team Members • Sameer Shende, Alan Morris , Allen Maloney, Kevin Huck (Oregon) • Jeff Larkin (Cray/ORNL) • Application Team Participants • Jackie Chen, David Lignell (SNL) • Facilitators • Kenny Roche, Pat Worley (ORNL)

Engineering CFD codes (RANS, LES) Physical Models DNS S3D: Direct numerical simulation (DNS) of turbulent combustion • State-of-the-art code developed at CRF/Sandia • 2007 INCITE award - 6M hours on XT3/4 at NCCS • Tier 1 pioneering application for 250TF system • Why DNS? • Study micro-physics of turbulent reacting flows • Full access to time resolved fields • Physical insight into chemistry turbulence interactions • Develop & validate reduced model descriptions usedin macro-scale simulations of engineering-level systems Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL

S3D - DNS Solver • Solves compressible reacting Navier-Stokes equations • High fidelity numerical methods • 8th order finite-difference • 4th order explicit RK integrator • Hierarchy of molecular transport models • Detailed chemistry • Multiphysics (sprays, radiation & soot) • From SciDAC-TSTC(Terascale Simulation of Combustion) Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL

S3D Parallelization Fortran90 + MPI • 3D domain decomposition • each MPI process manages part of the domain • All processes have same number of grid points & same computational load • Inter-processor communication only between nearest neighbors in 3D mesh • large messages; non-blocking sends & receives • All-to-all communication only required for monitoring & synchronization ahead of I/O S3D logical topology Text courtesy of S3D PI, Jacqueline H. Chen, SNL

A Performance Mystery in S3D on PWR4 (SDSC) • Following line of code (and many similar others) has ~70% L1 hit rate diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * (grad_Ys(:,:,:,n,m) + Ys(:,:,:,n) *grad_mixMW(:,:,:,m) )) Total L2 data cache accesses: 9784.594 M % accesses from L2 per cycle: 5.112 % L2 traffic: 1194408.401 MBytes L2 bandwidth per processor: 9183.869 MBytes/sec Total load and store operations: 33073.374 M Number of loads per load miss: 30.527 Number of stores per store miss: 1.014 Number of load/stores per D1 miss: 3.380 L1 cache hit rate: 70.415 % • Performance model provides expectation of 90%...

Discrepancy Understood, Performance Optimized • diffFlux is defined as a pointer: “diffFlux => grad_Ys” • Compiler unrolls the loop suboptimally • Loops over the 2nd index instead of the 1st • i.e., It accesses memory in “nx-size” strides • Alias analysis not sufficient to allow “obvious” optimization • Simple fix on IBM systems • Use “-qalias=noaryovrlp” compiler flag on IBM • Runtime on 8 PWR4+ 1.5 GHz CPUs, 200 timesteps • 2949 s (before), 2728 s (after) • 7.5% improvement and L1 hit rates what they should be • Same loops show expected ~93 % L1 hit rate on XT3/4

Vectorizing exp for S3D (SDSC) • Substantial time in exp in the getrates routine • Power4+ profiles • Code examination revealed calls were not vectorizable • Perl script transformed from 0% to 50% vectorized • Substantial Power4+ performance improvement • 30% for getrates routine • Approximately 10% overall • Smaller perfromance improvement on XT4 • Approximately 10% for getrates routine • Approximately ~1.5% overall • Subject of continuing tuning effort (D. Bailey)

Overall performance (15% of peak) 2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle highlighted loop accounts for 11.4% of total program waste Wasted Opportunity (Maximum FLOP rate * cycles - (actual FLOPs)) / total waste S3D Performance at the Loop Level (Rice)

S3D: What Opportunities Exist? 5D loop nest: 2D explicit loops 3D F90 vector syntax initialize performance problem data streams in/out of memory reuse update reuse

unroll and jam directives unswitching directives controlled fusion directives Apply LoopTool to S3D Diffusive Flux Loop !dir$ uj 3 dom=1,3 ! DIRECTION !dir$ uj 2 do n=1,n_spec-1! SPECIES !dir$ unswitch 2 if (baro_switch) then ! driving force includes gradient in mole fraction and baro-diffusion: !dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * ( grad_mixMW(:,:,:,m) & + (1 - molwt(n)*avmolwt) * grad_P(:,:,:,m)/Press)) else ! driving force is just the gradient in mole fraction: !dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = - Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) & + Ys(:,:,:,n) * grad_mixMW(:,:,:,m) ) endif ! Add thermal diffusion: !dir$ unswitch 2 if (thermDiff_switch) then !dir$ fuse 1 1 1 diffFlux(:,:,:,n,m) = diffFlux(:,:,:,n,m) - Ds_mixavg(:,:,:,n) * Rs_therm_diff(:,:,:,n) * molwt(n) * avmolwt * grad_T(:,:,:,m) / Temp endif ! compute contribution to nth species diffusive flux ! this will ensure that the sum of the diffusive fluxes is zero. !dir$ fuse 1 1 1 diffFlux(:,:,:,n_spec,m) = diffFlux(:,:,:,n_spec,m) - diffFlux(:,:,:,n,m) enddo ! SPECIES enddo! DIRECTION

m=1,3 if BS if TD else else if TD else n=1,nspec-1 n=1,nspec-2,2 if BS else if TD LoopTool n=1,nspec-2,2 n=1,nspec-2,2 n=1,nspec-2,2 2.94x faster than original (6.7% total savings) Optimization of S3D Diffusive Flux Loop (445 lines) (35 lines) Transformation Log: • Scalarization (4 stmts) • Loop unswitching (2 conditions) • Fusion (loops within 4 outer nests) • Unroll-and-jam (2 loops) • Peeling excess iterations (4 nests)

an implicit loop that copies a non-contiguous 4D slice of 5D data to contiguous storage 5.4% time S3D: An Unexpected Bottleneck Adjust routine interfaces to avoid copy 100% faster

S3D Node Performance Tuning Summary Achieved ~12.7% overall improvement • Node performance increased from 15% of peak to 17.4% • Estimated savings for 2M CPU hour run: 254K CPU hours • More opportunities remain • Register reuse and tiling of stencil computations • Inlining + fusion + array contraction of temporary variables • Further improvements require more changes • Lots of potential smaller improvements • Enabling technologies contributions • HPCToolkit enabled identifying and assessing bottlenecks • LoopTool helped automate tedious code transformations

S3D Scaling Performance (App Team)

S3D Scaling Study (Oregon) • Harness test case • Platform: Jaguar Combined Cray XT3/XT4 at ORNL • Several runs to identify scaling trends • Focus on 6400p • Evaluate impact of combined XT3/XT4 nodes • Performance evaluation of MPI_Wait • Study mapping of MPI ranks to nodes

WRITE_SAVEFILE* MPI_Wait Total Runtime Breakdown by Events - Time *Recent analysis indicates not a scaling issue

MPI_Wait times exhibit two equivalence classes Same equivalence classes also seen in memory bandwidth intensive computation routines TAU: ParaProf Profile

S3D Scaling Study Conclusion • Determined that XT3 nodes slowed certain S3D routines • Consistent across all XT3 nodes • Memory bandwidth limited routines • Suggested load balancing optimization • Reduce grid size in one dimension for XT3 nodes • Not yet implemented due to concerns over long term relevance • Provided estimate of benefit for combined XT3/XT4 runs • Many scaling and single node results appear in S3D IOP paper

S3D Modeling Results & Future Directions • PMaC predictions for S3D on XT3 and XT4 • Currently within 15% for an 8 CPU run • Extending to larger CPU counts • Working on improving accuracy • What is the expected performance of S3D on ORNL’s 250 TFLOP machine? • Will our optimizations benefit quad-core system? • Different cache structure • L2 1MB→512KB • L3 0 → 2MB shared • What architecture will S3D perform best on?

GTC Tiger Team • Team Lead: Shirley Moore (UTK) • PERI Team Members • Haihang You (UTK) • John Mellor-Crummey, Gabriel Marin, Guohua Jin (Rice) • Hongzhang Shan (LBNL) • Affiliate Team Members • Kevin Huck (UOregon) • Ed D’Azevedo (ORNL) • Lenny Oliker (LBNL) • Application Team Participants • Stephane Ethier, Weixing Wang, Wei-li Lee (PPPL) • Scott Klasky (ORNL) • Facilitators • Kenny Roche, Pat Worley (ORNL) • Bronis de Supinski (LLNL)

GTC: Gyrokinetic Toroidal Code from PPPL • Particle in Cell (PIC) code with gyrokinetic simulation • GTC-S: “shaped” code • More realistically represents experimentally relevant geometry • GTC-P is a new “petascale” version • Partitions the poloidal plane into radial shells • Fortran 90 and MPI, PETSc used for Poisson solves • Currently no OpenMP in GTC-S or GTC-P • OpenMP may be considered for multicore • Code team science goals • Impact of turbulent transport in burning plasma fusion devices • Integrated simulations for ITER plasmas for a range of temporal and spatial scales

The Gyrokinetic Toroidal Code • 3D particle-in-cell code to study microturbulencein magnetically confined fusion plasmas • Solves the gyro-averaged Vlasov equation • Gyrokinetic Poisson equation solved in real space • Low noise df method • Global code (full torus as opposed to only a flux tube) • Massively parallel: typical runs use 1024+ processors • Electrostatic (for now…) • Nonlinear and fully self-consistent • Written in Fortran 90/95 • Originally optimized for superscalar processors

Particle-in-Cell (PIC) Method • Particles sample distribution function. • The particles interact via a grid, on which the potential is calculated from deposited charges. • The PIC Steps • “SCATTER”, or deposit, charges on the grid (nearest neighbors) • Solve Poisson equation • “GATHER” forces on each particle from potential • Move particles (PUSH) • Repeat…

Charge Deposition Step (SCATTER operation) GTC Classic PIC 4-Point Average GK (W.W. Lee) Charge Deposition for Charged Rings:4-Point Average Method Point-charge particles replaced by charged rings due to gyro-averaging

Application Team’s Flagship Code: The Gyrokinetic Toroidal Code (GTC) Fully global 3D particle-in-cell code (PIC) in toroidal geometry Developed by Prof. Zhihong Lin (now at UC Irvine) Used for non-linear gyrokinetic simulations of plasma microturbulence Fully self-consistent Uses magnetic field line following coordinates (y,q,z) [Boozer, 1981] Guiding center Hamiltonian [White and Chance, 1984] Non-spectral Poisson solver [Lin and Lee, 1995] Low numerical noise algorithm (dfmethod) Full torus (global) simulation Scales to a very large number of processors Excellent theoretical tool!

q z GTC Mesh and Geometry Saves a factor of about 100 in CPU time Poloidal plane (cross-section) unstructured mesh z (Y,a,z) a = q - z/q Field-line following coordinates

Processor 2 Processor 0 Processor 3 Processor 1 New GTC Codes Use a New Parallel Model:Domain Decomposition + Particle Splitting • 1D Domain decomposition: • Several MPI processes can now share a section of the torus • Particle splitting method • The particles in a toroidal section are equally divided between several MPI processes • Particles randomly distributed between processors within a toroidal domain • Pure MPI version • But OpenMP still there… • for multicore?

New Version (GTC-S) InputsExperimental Equilibrium and Profiles • Original GTC has flat temperature and density to set the scale for the gyroradius and the grid, and an analytical gradient for the turbulence drive • GTC-S uses experimental profiles and plasma boundary extracted from the experimental database by using the widely-used TRANSP tool (http://w3.pppl.gov/transp/) • The magnetic equilibrium is calculated from the profiles and boundary by using ESC or JSOLVER • Spline coefficients are calculated for the equilibrium and profiles to allow interpolations at the particle positions

New Grid Follows Change inGyro-radius with Temperature Profile • Local gyro-radius proportional to temperature • Evenly spaced radial grid in new r coordinate where Original GTC circular grid with flat temperature New GTC-S follows T(r)

Poloidal Component of B Field Takeninto Account for gyro-orbit • For large-aspect ratio circular concentric cross-section, the difference between a poloidal plane and a gyro-plane is neglected. • A more accurate treatment is used here for general geometry. • Projection of gyro-plane on poloidal plane results in elliptic orbit. • 4-point average method uses ellipse

GTC Performance Issues • Three basic operation types govern PIC performance • Grids work (i.e., Poisson solve) • Particle processing (e.g., position and velocity updates) • Interpolation between the above (i.e., charge deposition and field calculation in particle pushing) • Main GTC performance bottleneck is the charge deposition, or scatter operation • True of most PIC codes • More complex in GTC due to fast gyrating particles • Motion described by charged rings tracked by their guiding center

More GTC Performance Issues • Some scaling issues with GTC-P relative to expectations • Time doubles when it should stay flat • Load imbalance in particle push routine apparently due to variation in TLB misses • 179% speedup going from single to dual core mode • Main computational kernels not memory bandwidth bound • Warning: as number of cores increases, other routines that are showing slowdown on dual core may start to dominate

Status of GTC Tiger Team Effort • PERI Application Survey completed • Several conference calls w/application team participants • GTC and GTC-S code versions released to Tiger Team and Performance Database WG members on request • Awaiting release of GTC-P code to investigate: • Poor scaling • Load imbalance issues • Profiling of GTC-S carried out on Jaguar using TAU • Data accessible in password-protected PerfDMF database • Optimization of charge deposition by UTK • Detailed modeling, analysis, and optimization of GTC-S by Rice • Brief summary follows; details in a submitted paper

TAU Profile Showing Weak Scaling of GTC-S on Jaguar

Hand optimization of Charge Deposition (UTK) • Hand-tuning techniques • Common subexpression elimination • Code movement • Loop unrolling • Cache blocking • Improved performance of chargei by ~10% • Changes incorporated into GTC-S code • Written up as success story for Fred Johnson

Modeling, Analysis, and Optimization of GTC at Rice Detailed modeling of computation and memory hierarchy performance of GTC-S using Rice modeling toolkit Identified opportunities for data and loop transformations Transformations improved program node performance 33% on Itanium2 and 13% on Opteron 275 Changes sent to Stephane Ethier; awaiting response

Total L3 miss count • L3 cache misses due to fragmentation of data in cache lines: 14.4% of total • fragmentation of arrays zion (AKA particle_array) + zion0, accounts for: • 95% of all L3 fragmentation misses • 48% of all misses to the zion arrays • 13.7% of total L3 cache misses particle_array is an alias to array zion used in gcmotion GTC-S Memory Hierarchy Performance - I GTC-S suffers from poor spatial locality due to data layout • Model L3 cache miss counts for individual arrays at the loop level (values predicted for 64 radial grid points and 15 particles/cell) • Solution: transpose particle arrays zion and zion0 • transform arrays of structures into structures of arrays

Two loops in main carry 40% of all L3 carried misses; misses cannot be removed. 21.4% of misses are carried by the iterative loop of the Poisson solver. A recurrence in the solver prevents transformations. GTC-S Memory Hierarchy Performance - II Understanding spatial and temporal data reuse patterns in GTC-S • Figure below: program scopes carrying > 2% of L3 cache misses • Carried misses are non-compulsory misses (capacity + conflict misses) • Carrying scope is the innermost dynamic scope in which the data is reused (values predicted for 64 radial grid points and 15 particles/cell) • Focus on routines chargei and pushi • Fuse the two main loops in chargei • Apply tiling and fusion over several loop nests in pushi

335 do kz=1,mzbig 336 wz=real(kz)/real(mzbig) 337 zdum=zetamin+deltaz*(real(k-1)+wz) 338 do i=idiag1,idiag2 339 ii=igrid(i) 340 do j=1,mtdiag  346 phiflux(kz+(k-1)*mzbig,j,i)=  347 348 enddo 349 enddo 350 enddo Outer loop kz iterates over inner dimension of phiflux • Interchange loop kz to the innermost position GTC-S Memory Hierarchy Performance - III Pinpointing and reducing TLB misses (values predicted for 64 radial grid points and 15 particles/cell) • Additional transformations • Apply unroll & jam to increase ILP in routine spcpft • Transform arrays used in the Poisson solver to improve spatial locality

Side effect: big reduction in unnecessary data prefetches inserted by Intel compiler Itanium2 has 16KB dedicated instruction cache. Improvements in data locality negated by increase in instruction cache misses. Bigger impact expected with larger instruction cache, e.g. Montecito. GTC-S Performance Improvements on Itanium2 • Percentages represent incremental improvements for each transformation • Results for 10 and 100 particles/cell

GTC-S Performance Improvements on Opteron • Issues • Hardware prefetcher crucial for performance on Opteron • Prefetcher tracks up to 20 parallel data streams • Zion transpose increases # of parallel streams in key loops • Reduces effectiveness of hardware prefetcher • Data reuse improvements are negated by higher number of non-prefetched memory accesses • Approach • Reorganize five arrays in pushi as one array • Reorganize fourteen arrays in gcmotion as four arrays • Result: Improves execution time on Opteron by 13% • Reduces cache and TLB misses by > 50%

Exploring Run-time Data Reordering at Rice Issue Performance degrades during GTC execution as particles become disordered w.r.t. underlying tokamak grid Preliminary study Particle reordering improves temporal locality during charge deposition and particle pushing Currently developing on-line feedback and control mechanism for particle reordering

What Worked Close interactions with multiple members of app teams Tiger Team specific mailing list for S3D Generated team-wide comments, tapping more expertise Not used very much for GTC Large distributed teams Somewhat surprising Avoid duplication of measurement effort University of Oregon participation as affiliate was exemplary Rice participation was also exemplary Publications and publicity S3D science focused IOP paper SciDAC Review paper & SciDAC conf. presentation (Mellor-Crummey) GTC success for Fred Johnson

What Didn’t Timing of application selection Not finalized until halfway through fiscal year Delayed by survey OK for first year; future implications? Long Jaguar down time soon after teams formed Initial understanding of code distribution Provided through JOULE process, NOT direct from application teams An appropriate distribution mechanism but unsettling to application team Frequent, on-going concern of application teams Will always start with application team in future, regardless of reason for selection or appropriateness of distribution Mechanism for providing improvements back to application team Slow and cumbersome; No CVS access May not be solvable due to application team need for internal control Addressed by repeated direct interactions

FY08 Tiger Team Issues and Proposed Solutions Which applications will be focus of FY08 Tiger Teams? Guidance from HQ requested Recommend one XT4-focused team, one BG/P-focused team Is JOULE precedent to continue? Expect timing to be similar to FY07 (January/March), if maybe a little sooner Plan to continue work with S3D and GTC during FY08 selection process Solves late Q2 decision, one of FY07’s biggest issues Suggest elimination of Q3 reassessment milestone in light of timing What happens to teams from previous year? Application tuning does not respect fiscal year boundaries Good relationships established Don’t want to lose them Are liaison activities sufficient to maintain them? Plan to slowly devolve FY07 teams into very active liaison activities Use different participants for FY08 teams in order to balance staffing requirements How do we ensure that the results are publicized? Initial S3D paper is good; potential for more GTC success story is good; plan similar one for S3D Continued interactions will support solving this question

PERI Tiger Teams FY07 Report

PERI Tiger Teams FY07 Report

Presentation Transcript

NRS Review for FY07

North America ISU FY07

FY07 Procurement Award Nominations

Tiger

Tiger

tiger

Tiger

Tiger

Peri-implantitis

tiger

Jacopo Peri

Tiger

XTOD FY07 Plan

FY07 Logistics CLM Update

Tiger

Report from MICE project teams

Report from MICE project teams

FY07 1H Results and FY07 Outlook

Tiger

FY07 SHERM Metrics Summary

Company Meeting – FY07

FY07 Continuation Plan