270 likes | 353 Vues
Tensor contraction engine & extensible many-electron theory module in NWChem. So Hirata Pacific Northwest National Laboratory MSS group meeting (24 Oct, 2002). Collaborators & Sponsors. M. Nooijen (Princeton University) R. J. Harrison & D. Bernholdt (Oak Ridge National Laboratory)
E N D
Tensor contraction engine& extensible many-electron theory module in NWChem So Hirata Pacific Northwest National Laboratory MSS group meeting (24 Oct, 2002)
Collaborators & Sponsors • M. Nooijen (Princeton University) • R. J. Harrison & D. Bernholdt (Oak Ridge National Laboratory) • D. Cociorva, G. Baumgartner, R. Pitzer, & P. Sadayappan (Ohio State University) • J. Ramanujam (Louisiana State University) • Office of Basic Energy Science, Department of Energy • Office of Biological and Environmental Research, Department of Energy 2
Purpose of this project • Create a high-level symbolic manipulation language that derives working equations of second-quantized many-electron theories and implement them automatically • Expedites complex and error-prone many-electron theory implementation • Helps develop and examine new theories or approximations • Facilitates parallelization and other laborious code optimizations • CCSDT T3 code is >18000 lines in Fortran77! 3
Operator contraction engine (OCE) • Object-oriented symbolic manipulation program that derives working equations from any well-defined second-quantized many-electron theory ansatz • Performs valid contractions of normal-ordered operators according to Wick’s theorem and reduces any given ansatz into the simplest form of tensor contraction expressions • Consolidates identical terms and recognizes terms that are related by permutation symmetry 4
Tensor contraction engine (TCE) • Object-oriented symbolic manipulation program that analyzes tensor contraction expressions and implement them into efficient programs • Breaks down multiple tensor contractions (A=BCDE) into a sequence of elementary tensor contractions (X=DE; Y=BX; A=YC) with minimal operation costs • Factorizes common contractions [X=BC+BD into X=B(C+D)] • Generates debug-level Fortran90 programs and release-level parallel Fortran77 programs 5
What is new? • Full exploitation of index permutation symmetry • Taking advantage of spin and spatial symmetry also, the programs generated by TCE are theoretically operation cost minimal • OCE extracts permutation symmetries among working equations • TCE breaks down permutation operators into elementary permutation operators, analyzes which permutation symmetries can be exploited, and reflects the result to the generated codes 7
Permutation symmetry • Primitive tensors that appear in many-electron theories possess “permutation anti-symmetry.” For example, a two-electron integral tensor and a three-electron excitation amplitude tensor have the following properties: 8
Implication • Reduced storage size • Instead of storing full , we may keep only • Reduced operation cost by shorter summation index ranges • Reduced operation cost by shorter target index ranges • Instead of computing full , we may obtain just 9
Challenges • What is the index permutation symmetry of an intermediate tensor? • Consider the intermediate • What is the best way to store just the non-redundant elements of tensors? • What is the operation cost minimal contraction of two tensors with permutation symmetry? • How can TCE generate a code that exploits spin, spatial, and permutation symmetries at the same time? 10
Index permutation symmetry versus permutation symmetry of tensor contraction expressions • Index permutation anti-symmetry ultimately reflects the Pauli principle of fermions; any tensor having electron indices (such as integrals, excitation amplitudes) is anti-symmetric • When there is such a multiple tensor contraction asthere “must” be also 11
Break down of permutation operators • When breaking down a multiple tensor contraction into a sequence of binary tensor contractions, we should break down the permutation operators appropriately, so that each intermediate has maximum index permutation symmetries 12
What is the best way to store an intermediate? • An intermediate tensor has much more limited index permutation symmetries. Super (sub) indices are categorized into global targets and local targets, and permutation anti-symmetry exists among just global targets and among just local targets. So in general, the non-redundant elements are: 13
What is the general form of tensor contraction with permutation symmetry? • Expansion Note that an excitation amplitude tensor will not have local target indices. This is because two excitation amplitudes cannot contract (as they have super particles, sub holes structures). 14
What is the general form of tensor contraction with permutation symmetry? • Contraction Note that at least one of the two tensors is always an excitation amplitude tensor. 15
What is the general form of tensor contraction with permutation symmetry? • Compression 16
Spin & spatial symmetry • Spin symmetry • Spatial symmetry 17
An example LOOP OVER b,j<=k BLOCKS LOOP OVER l,c,i BLOCKS LOOP OVER d BLOCKS IF (b<=d) READ t(b<=d,j<=k) IF (d<b) READ t(d<b,j<=k) READ v(l<c,i<d) ! Always holes < particles IF (spin/spatial sym block of t is non-zero) THEN IF (spin/spatial sym block of v is non-zero) THEN MAKE x(l,b,c,i,j<=k) BLOCK BY DGEMM IF (b<=c and i<=j) ACCUM x(l,b<=c,i<=j<=k) IF (b<=c and j<=i,i<=k) ACCUM -x(l,b<=c,j<=i<=k) IF (b<=c and k<=i) ACCUM x(l,b<=c,k<=i<=j) IF (c<=b and i<=j) ACCUM -x(l,c<=b,i<=j<=k) IF (c<=b and j<=i,i<=k) ACCUM x(l,c<=b,j<=i<=k) IF (c<=b and k<=i) ACCUM -x(l,c<=b,k<=i<=j) END IF ! Note that b=c, i=j block is accumulated END IF ! multiple times END LOOP END LOOP END LOOP 18
Extensible many-electron theory module in NWChem • “Extensible” because a new many-electron method can be added relatively easily by TCE • Very general tensor storage interface (needs only size & offsets of one-dimensional compressed tensor arrays; intermediate arrays’ offsets are also computed in run-time by programs generated by TCE ) • Compatible one- and two-electron integral transformation codes and offset generators 19
Optimizations • Spin, spatial, permutation symmetries • Dynamic tiling (orbital ranges are “tiled” (or blocked) into smaller section so that the peak local memory usage does not exceed the user-specified limit) • Dynamic load balancing parallelism (each tile-level tensor contraction is carried out in one processor with virtually no communication) • Disk I/O is based on Shared File Library of ParSoft, which allows one-sided (independent) read/write without Global Array cache • Local sorting of array elements (so that the composite summation indices become contiguous in memory) followed by local DGEMM (with absolutely no communication in this critical step) 20
GA to MA sort (communications!) MA GA DRA DRA DRA DRA DRA Collective I/O (synchronization!) & GA cache MA MA to MA sort (no communications!) MA SF SF SF SF SF One-sided I/O (no synchronization!) Previous & new algorithms 21
Methods available • Various spin-unrestricted coupled-cluster methods • LCCD, CCD, LCCSD, CCSD, CCSDT • More to follow (higher CC, CI, MBPT, EOM-CC, etc.) • Input syntax • Uses NWDFT module for the ground state dft xc Hfexch 1.0 end tce ccsd thresh 1e-6 maxiter 100 end task tce energy 22
A sample output (water CCSD/sto-3g) NWChem General Electron-Correlation Theory Module ------------------------------------------------- Programs generated by a Tensor Contraction Engine General Information ------------------- Wavefunction type : Restricted No. of electrons : 10 Alpha electrons : 5 Beta electrons : 5 /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ Correlation Information ----------------------- Calculation type : Coupled-cluster singles & doubles (CCSD) Max iterations : 100 Residual threshold : 0.10E-09 Memory Information ------------------ Available GA+MA space size is 26213624 doubles Maximum block size 50 doubles 23
A sample output (continued) Suggested orbital blocking Block Spin Irrep Size Offset ----------------------------------------- 1 alpha a 5 doubles 0 2 beta a 5 doubles 5 3 alpha a 2 doubles 10 4 beta a 2 doubles 12 /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ 2-e file size = 5443 2-e file name = ./temp.v2 Cpu time / sec 0.0 /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ t2 file size = 300 t2 file name = ./temp.t2 Cpu time / sec 0.0 MBPT(2) correlation energy = -0.035867246917899 hartree MBPT(2) total energy = -74.998530309066552 hartree Cpu time / sec 0.0 24
A sample output (continued) ------------------------------------------------------- Iter Residuum Correlation Cpu/Sec ------------------------------------------------------- 1 0.089123237955088 -0.035867246917899 0.1 2 0.031759620132034 -0.045406888265697 0.1 3 0.012682891602275 -0.048387005902666 0.1 4 0.005383277884425 -0.049437059764660 0.1 5 0.002395445228466 -0.049839118488995 0.1 6 0.001110827268269 -0.050002172402908 0.1 /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ 26 0.000000002031284 -0.050127328255753 0.1 27 0.000000001066715 -0.050127328323605 0.0 28 0.000000000560286 -0.050127328359134 0.1 29 0.000000000294338 -0.050127328377747 0.1 30 0.000000000154649 -0.050127328387501 0.1 31 0.000000000081266 -0.050127328392616 0.1 ------------------------------------------------------- CC iteration converged CCSD correlation energy = -0.050127328392616 hartree CCSD total energy = -75.012790390541269 hartree Task times cpu: 2.0s wall: 2.4s 25
Titan spin-adapted parallel CCSD code H2O CCSD/cc-pVTZEnergy = – 0.2850225 hartree1 node sym=off 16.8 secs/iter1 node sym=on 16.6 secs/iter2 nodes sym=off 8.2 secs/iter2 nodes sym=on 8.3 secs/iter Present spin-unrestricted parallel CCSD code H2O CCSD/cc-pVTZEnergy = – 0.2850225 hartree1 node sym=off 49.1 secs/iter1 node sym=on 14.5 secs/iter2 nodes sym=off 25.2 secs/iter2 nodes sym=on 7.5 secs/iter Performance Spin-unrestricted code has to deal with 3 times as many t-amplitudes as does spin-adapted code, so theoretically spin-adapted code should be 3 times as fast as spin-unrestricted code 26
Future plans • CCSDTQ, CI, MBPT, EOM-CC implementation • What is the appropriate tensor formulation for MBPT? (are the MBPT denominators tensors?) See Head-Gordon et al. • “Persistent intermediates” (or the so-called similarity transformed Hamiltonian matrix elements) in EOM-CC • CC(2)PT(2) implementation • Post-CCSD(T) O(n7) method that includes perturbative quadruples • Further optimization (loop fusion, more aggressive factorization, space-time tradeoffs, etc.) by computer scientist colleagues • Modular extensibility of operator contraction engine • Active spaces (multi-reference methods) • Orbital rotations (atomic-orbital-based or local correlation methods) 27