Scalable Stochastic Programming

Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov Joint work with MihaiAnitescu, Miles Lubin and Victor Zavala

Motivation • Sources of uncertainty in complex energy systems • Weather • Consumer Demand • Market prices • Applications @Argonne – Anitescu, Constantinescu, Zavala • Stochastic Unit Commitment with Wind Power Generation • Energy management of Co-generation • Economic Optimization of a Building Energy System

Stochastic Unit Commitment with Wind Power Thermal Units Schedule? Minimize Cost Satisfy Demand/Adopt wind power Have a Reserve Technological constraints • Wind Forecast – WRF(Weather Research and Forecasting) Model • Real-time grid-nested 24h simulation • 30 samples require 1h on 500 CPUs (Jazz@Argonne) Thermal generator Wind farm Slide courtesy of V. Zavala & E. Constantinescu

Optimization under Uncertainty • Two-stage stochastic programming with recourse (“here-and-now”) subj. to. subj. to. continuous discrete Sample average approximation (SAA) subj. to. Sampling Statistical Inference M batches

Solving the SAA problem – PIPS solver • Interior-point methods (IPMs) • Polynomial iteration complexity: (in theory) • IPMs perform better in practice (infeasible primal-dual path-following) • No more than 30-50 iterations have been observed for n less than 10 million • We can confirm that this is still true for n being hundred times larger • Two linear systems solved at each iteration • Direct solvers needs to be used because IPMs linear systems are ill-conditioned and needs to be solved accurately • We solve the SAA problems with a standard IPM (Mehrotra’s predictor-corrector) and specialized linear algebra • PIPS solver Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Linear Algebra of Primal-Dual Interior-Point Methods Convex quadratic problem IPM Linear System Min subj. to. Multi-stage SP Two-stage SP nested arrow-shaped linear system (modulo a permutation) S is the number of scenarios

The Direct Schur Complement Method (DSC) • Uses the arrow shape of H • 1.Implicit factorization 2. Solving Hz=r • 2.1. Backward substitution 2.2. Diagonal Solve 2.3. Forward substitution

Parallelizing DSC – 1. Factorization phase Process 1 Dense linear algebra LAPACK Process 2 Process 1 2. Triangular solves Process p Factorization of the 1st stage Schur complement matrix = BOTTLENECK Sparse linear algebra MA57

Parallelizing DSC – 2. Triangular solves Process 1 Process 1 Process 2 Process 2 Dense linear algebra Process 1 1.Factorization Process p Process p 1st stage backsolve = BOTTLENECK Sparse linear algebra Sparse linear algebra

Implementation of DSC Comm Comm Dense solve forw.subst. Proc 1 Fact Backsolves Dense fact backsolve Dense solve forw.subst. Proc 2 Fact Backsolves Dense fact backsolve Dense solve forw.subst. Proc p Fact Backsolves Dense fact backsolve MPI_Allreduce MPI_Allreduce Computations are replicated on each process. Factorization Triangular solves Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Scalability of DSC Unit commitment 76.7% efficiency butnot always the case Large number of 1st stage variables: 38.6% efficiency on Fusion @ Argonne

BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER

The Stochastic Preconditioner • The exact structure of C is • IID subset of n scenarios: • The stochastic preconditioner(P. & Anitescu, in COAP 2011) • For C use the constraint preconditioner (Keller et. al., 2000)

Implementation of PSC Comm Comm Comm Dense fact of prcnd. F. B. Proc p+1 Backsolve F. B. Fact Backsolves backsolve Proc 1 F. B. Fact Backsolves backsolve Proc 2 F. B. Proc p Fact Backsolves backsolve MPI_Reduce (to proc p+1) MPI_Reduce (to proc 1) MPI_Allreduce Comm Proc p+1 Prcnd tri. slv. • forw.subst. comm Proc 1 Krylov solve • forw.subst. REMOVES the factorization bottleneck Slightly larger solve bottleneck Proc 2 idle • forw.subst. Proc p idle • forw.subst. MPI_Bcast

The “Ugly” Unit Commitment Problem • DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling • 120 scenarios Factorization of the preconditioner can not be hidden anymore.

Quality of the Stochastic Preconditioner • “Exponentially” better preconditioning (P. & Anitescu, 2011) • Proof: Hoeffding inequality • Assumptions on the problem’s random data • Boundedness • Uniform full rank of and not restrictive

Quality of the Constraint Preconditioner • has an eigenvalue 1 with order of multiplicity . • The rest of the eigenvalues satisfy • Proof: based on Bergamaschiet. al., 2004.

The Krylov Methods Used for • BiCGStab using constraint preconditioner M • Preconditioned Projected CG (PPCG) (Gould et. al., 2001) • Preconditioned projection onto the • Does not compute the basis for Instead,

Performance of the preconditioner • Eigenvalues clustering & Krylov iterations • Affected by the well-known ill-conditioning of IPMs.

SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA

Parallelizing the 1st stage linear algebra • We distribute the 1st stage Schur complement system. • C is treated as dense. • Alternative to PSC for problems with large number of 1st stage variables. • Removes the memory bottleneck of PSC and DSC. • We investigated ScaLapack, Elemental (successor of PLAPACK) • None have a solver for symmetric indefinite matrices (Bunch-Kaufman); • LU or Cholesky only. • So we had to think of modifying either. densesymm. pos. def., sparse full rank.

Cholesky-based -like factorization • Can be viewed as an “implicit” normal equations approach. • In-place implementation inside Elemental: no extra memory needed. • Idea: modify the Cholesky factorization, by changing the sign after processing p columns. • It is much easier to do in Elemental, since this distributes elements, not blocks. • Twice as fast as LU • Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.

Distributing the 1st stage Schur complement matrix • All processors contribute to all of the elements of the (1,1) dense block • A large amount of inter-process communication occurs. • Each term is too big to fit in a node’s memory. • Possibly more costly than the factorization itself. • Solution: collective MPI_Reduce_scatter calls • Reduce (sum) terms, then partition and send to destination (scatter) • Need to reorder (pack) elements to match matrix distribution • Columns of the Schur complement matrix are distributed as they are calculated

DSC with distributed first-stage For each b=1:B ELEMENTAL Comm ELEMENTAL Comm Proc 1 Fact Backsolves backsolve forw.subst. Fact Backsolves forw.subst. Proc 2 backsolve Fact Backsolves forw.subst. Proc p backsolve MPI_Reduce_scatter MPI_Allreduce Schur complement matrix is computed and reduced block-wise. (B blocks of columns ) Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Reduce operations • Streamlined copying procedure - Lubin and Petra (2010) • Loop over continuous memory and copy elements in send buffer • Avoids divisions and modulus ops needed to compute the positions • “Symmetric” reduce for • Only lower triangle is reduced • Fixed buffer size • A variable number of columns reduced. • Effectively halves the communication (both data & # of MPI calls).

Large-scale performance • First-stage linear algebra: ScaLapack (LU), Elemental(LU), and • Strong scaling of PIPS with and • 90.1% from 64 to 1024 cores • 75.4% from 64 to 2048 cores • > 4,000 scenarios • On Fusion • Lubin, P., Anitescu, in OMS 2011 SAA problem: 1st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200

Towards real-life models – Economic dispatch with transmission constraints • Current status: ISOs (Independent system operator) use • deterministic wind profiles, market prices and demand • network (transmission) constraints • Outer 1-h timestep 24 horizon simulation • Inner 5-min timestep 1h horizon corrections • Stochastic ED with transmission constraints (V. Zavala et. al. 2010) • Stochastic wind profiles & transmission constraints • Deterministic market prices and demand • 24 horizon with 1h timestep • Kirchoff’s laws are part of the constraints • The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil Load node (bus) Generator

Solving ED with transmission constraints on Intrepid BG/P • 32k wind scenarios (k=1024) • 32k nodes (131,072 cores) on Intrepid BG/P • Hybrid programming model: SMP inside MPI • Sparse 2nd-stage linear algebra: WSMP (IBM) • Dense 1st-stage linear algebra: Elemental with SMP BLAS + OpenMPfor packing/unpacking buffer. • For a 4h Horizon problem very good strong scaling • Lubin, P., Anitescu, Zavala – in proceedings of SC 11.

Stochastic programming – a scalable computation pattern • Scenario parallelization in a hybrid programming model MPI+SMP • DSC, PSC (1st stage < 10,000 variables) • Hybrid MPI/SMP running on Blue Gene/P • 131k cores (96% strong scaling) for Illinois ED problem with grid constraints. 2B variables, maybe largest ever solved? • Close to real-time solutions (24 hr horizon in 1 hr wallclock) • Further development needed, since users aim for • More uncertainty, more detail (x 10) • Faster Dynamics  Shorter Decision Window (x 10) • Longer Horizons (California == 72 hours) (x 3) Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"

Thank you for your attention! Questions?

Scalable Stochastic Programming