Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality

Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality Course Project Presentation: Mustafa Imran Ali Ali Mustafa Zaidi

Outline of Presentation • Brief Introduction • Motivation for this problem • Simulated Annealing (SA) • Simulated Evolution (SimE) • Related Work • Classification of Parallel Strategies • Previous efforts • Our Efforts and Results • Low-Level Parallelization of Both Heuristics • Domain-Decomposition Parallelizations of SimE • Multithreaded-Search implementations of SA and SimE • Conclusions Drawn, Future Work

Motivation • Stochastic Heuristics are utilized to solve a wide variety of combinatorial optimization problems • VLSI CAD, Operations Research, Network Design etc. • These heuristics attempt to find near-optimal solutions by performing an ‘intelligent search’ of the ‘solution space’ • Each algorithm has a built in ‘intelligence’ that allows it to move towards successively better solutions. • But they usually require large execution times to achieve near optimal solutions.

Motivation • Parallelization of Heuristics aims to achieve one of two basic goals: • Achieve Speedup • Achieve Better Quality Solutions • Parallelization of Heuristics different from Data and Functional parallelism: • Different strategies alter the properties of basic algorithm in different ways • Thus, Quality of solution achievable is sensitive to parallelization strategy • Key is to select/develop parallelization strategy that has maximum positive impact on heuristic • With respect to the goals we are trying to achieve: • For better quality solutions, strategy should enhance algorithmic “intelligence”. • For better runtimes, strategy should focus on reducing work-done per processor.

Our Objectives • Our goal is to explore, implement and develop parallelization strategies that allow us to: • Achieve near-linear speedup (or best effort) • Without sacrificing quality (fixed constraint) • Speedup trends should be as scalable as possible

Approaches taken and Considerations • In order to effectively consider this problem, we must look at it in terms of several aspects: • The nature of the Heuristics themselves • The nature of the Parallel Environment • The nature of the Problem Instance and Cost Functions • All three factors influence both the • Runtime, and • Achievable Solution Quality of any parallelization strategy. • For our task, we must optimize all three factors.

Introduction to Simulated Annealing • Basic Simulated Annealing Algorithm • With Metropolis Loop

Introduction to Simulated Evolution • Basic Simulated Evolution Algorithm

Related Work • Classification of Parallelization Strategies for Metaheuristics [ref] • Low-Level Parallelization • Domain Decomposition • Parallel or Multithreaded Search • Previous work done for Parallel SA • Extensive – general as well as problem specific • All three approaches tried, • Type 3 most promising – still active area of research • Previous work done for Parallel SimE • Minimal – Type 2 parallelization strategy proposed, both by designers of SimE

Our Efforts • Starting Point for our work this semester: • Basic version of Type 3 Parallel SA • Basic version of Type 2 Parallel SimE • Parallelization of SA: • Several Enhanced versions of Type 3 • Implementation of Type 1 • Type 2 not implemented because… • Parallelization of SimE • Several Enhancements to Basic Type 2 • Implementation of Type 1 • Implementation of Type 3

Basic Type 3 Parallel SA • Basic Type 3 Parallel SA: • Based on Asynchronous Multiple Markov Scheme developed in [ref] • Best Type 3 scheme developed for SA to date. • Primarily intended for improving solution qualities achievable over serial version

Basic Type 3 Parallel SA Algorithm

Type 3 Parallel SA – Second Version • Strategy 2 - Speed-up Oriented Type 3 Parallel SA • From the above starting point, we saw that high-quality characteristics of basic Type 3 may be exploited to produce a speed-up oriented version • Expected to be Capable of achieving quality equivalent to serial version, but not better • While providing near-linear runtimes • Near-linear runtimes forced by dividing workload on each processor by number of processors • M/p metropolis iterations instead of M.

Speed-up Oriented Type 3 Parallel SA • Speedup oriented Type 3 Parallel SA • Results consistently show a 10% drop in achievable solution quality from the Serial version • Runtimes show near-linear speedups, as expected

Lessons Learned from Strategy 2 • We reasoned that • The 10% quality drop occurs due to negative impact on the ‘intelligence’ of the parallel SA. • To restore achievable quality, we must counteract this effect. • Intelligence of SA: • Lies in the “Cooling Schedule” [ref Dr. Sait’s book] • Division of workload directly tampers with the Cooling Schedule. • Proposed Solution: • Attempt to optimize the Cooling Schedule to take into account the parallel environment – a “Multi-Dimensional” Cooling Schedule. • To maintain Speedup, M/p remained unchanged, while other parameters varied across processors

Third Version of Type 3 Parallel SA • Strategy 3 – Varying Parameter settings across different processors • Expected that this would result in more effective search of the solution space. • Several sets of parameter settings were tried • Primarily by Varying  and T across processors • Processors with higher T and lower  would perform more random search • Processors with lower T and higher  would be more greedy • Intermittent sharing of information should diversify search from same position. • Results obtained from these versions: • NO IMPROVEMENT OF QUALITY OVER STRATEGY 2 • Show results (hadeed_1)

Lessons Learned from Strategy 3 • Based on the last two attempts, we reasoned that: • M/p drastically reduces time spent searching for better solutions at a given temperature • Lower temperatures achieved quicker, • Adverse effect on hill-climbing property of Heuristic. • Simple division of M/p inadequate for sustaining quality. • What to do next? • Develop techniques that minimize adverse effect on algorithmic intelligence. • Two different tracks possible • Type 1 parallel SA • Further Study Runtime vs Quality trends of Type 3 Parallel SA

What to do next? • Type 1 Parallel SA • Pros: • Leaves algorithmic intelligence intact • Solution quality guaranteed. • Cons: • High communication frequency adversely affects runtime • Environmental factors come into play. • Further explore dynamics of Type 3 • Pros: • Low communication frequency • Suitable for cluster environment (show chart) • Cons: • Uncharted territory - progress not assured

Type 1 Parallel SA • Low-level parallelization: • Divide Cost Computation Function • Our cost function has 3 parts: • Wirelength, Power, Delay • First two are easy to divide • Division of Delay computation posed a significant challenge • Too many replicated computations across processors negated benefits of division. • Eventually had to be excluded

Results of Type 1 Parallel SA • Performance of Type 1 Parallel SA: • Abysmal!! • Reasons: • 2 collective communications per iteration • Amount of work divided is small compared to communication time • Communication delay increases with increasing processors • (Show Chart) • Found to be completely unsuitable for • MIMD-DM environment, and • Our problem instance.

Further Exploring Type 3 Parallel SA • To improve our achievable quality of our Type 3 parallel SA: • In depth study of the impact of parameter M on achievable solution quality. • All experiments first attempted for the Serial Version, then replicated to the parallel version • Based on what we know of Type 3 Parallelization schemes, see how any new lessons can be incorporated into it.

Impact of ‘M’ on Solution Quality (Serial)

Impact of ‘M’ on Solution Quality (Parallel 7)

Observations • We see that initially, fastest improvement in quality appears with smallest M • However, Quality saturates earlier with smaller M. • Thus it might be beneficial to increase M as time progresses. • But by mow much? How to minimize runtime.

Learning From Experiments • Another observation helps: • Until saturation, rate of improvement nearly constant (per metropolis calls) • This applicable for all runs for a given circuit. • Thus best way to minimize time while sustaining quality improvement: • Set value of M adaptively such that the average rate of improvement remains constant. • New Enhancement to Serial SA. • Since Type 3 parallel SA improves faster than serial version, it is expected that some speedup will be observed • Experiments still under way – parameter tuning being done

Preliminary results • For 7 processors in parallel

Preliminary Results • Results are similar to the original implementation • Enhancement to serial SA mitigates the observed benefits of the parallel version • Although further parameter tuning/code refinements may improve parallel results even more.

Conclusions Drawn from Experiments • Type 3 parallelization of SA is most robust • Minimum susceptibility to environment and problem instance • Type 1 fails due to unsuitability to environment and to problem instance • For Parallel SA, direct trade off exists between achievable solution quality and speedup • Depending on quality desired, linear speedup possible. • For highest quality, speed-up diminishes to 1 in most cases. • Further experimentation needed to verify these points.

Parallel Simulated Evolution Strategies

Base Work for SimE • Type II (Domain Decomposition) implementation • Problem: Poor Runtimes with Quality Degradation • Improvement Proposed: Use of Random Row Allocation over the Fixed Row Allocation used in previous work • Results: Improvement of Quality over Fixed Row Allocation but still short of Serial Quality

Type II Quality Issues • Observation: Parallel Quality will always lag behind serial quality • Reason: Division of Solution into Domains restricts optimum cell movement (worsens with more processors/partitions) • Focus on improving runtime!

Type II Runtime Improvements • How? • Reduce Communication time • Reduce Frequency of Communications • Reduce Amount of Data Communicated • Overlap Communication with Computation • Reduce Computations • Can Workload be still better divided? • How will workload division affect communication requirements?

Type II SimE Algorithm Structure Communication Cost Computations

Type II Communication Optimization • No potential for Computation Communication Overlap • Implicit Barrier synchronization at communication points • Possibility of Reducing Communication Frequency Over Multiple Iterations • Do multiple operations (E,S,A) on assigned partition before communicating • Impact on Solution Quality due to accumulated errors • Actual Impact on Solution Quality vs. runtime improvement not presently quantified (future work)

Type II Communication Optimization(2) • Reducing Communication Frequency • by combining gather & broadcast operation into 1 MPI call • Efficiency of collective call not too superior in MPI implementation used • Reduce Data Communicated per call • Significant impact on runtime due to barrier synchronization • Essentially compressing placement data! • Tradeoff: Added computations in data preprocessing and post processing step

Type II Computation Optimization • Can workload be better divided? • Evaluation, Selection and Allocation already localized (divided) • What about cost computation? • Goodness Evaluation needs costs computed • Dependencies across partitions • Delay computations over long paths • Spans over partitions • Wire length & Power Computation • Potential for cost computation division

Wirelength & Power Cost Division • More independent of other partitions than delay computations • Effect of computation division can be readily evaluated for wire length and power (within the limited time constraints) • Tradeoff: Added Computation • Partial Costs Need to be communicated to Master • Additional communication phase added per iteration! • Actual benefit over non-division: Results will tell!

Results & Observations • Effect of communication worsens speed-ups with increasing processors • communication time over shadows gain of computation division • Optimizing communication resulted in greater gains than cost computation division • Dependencies do not allow much gains

Targeting Better Solution Qualities • Type II parallelization has poor quality issues • How to maintain quality? • Type I Parallelization • Same Quality as Serial Implementation guaranteed • Speed-ups governed by gains resulting from division of cost-computation • Type III Parallelization • Can we benefit from parallel co-operating searches?

Type I Parallelization • Target as much computation division without dividing SimE algorithm’s core “intelligence” • Computations that don’t affect intelligence • Cost computations • Goodness evaluations • Computations that affect intelligence • Selection function • Allocation function

Type I Parallelization • Again, wire length & power easier to divide than delay (same reasoning as for Type II) • Workload division achieved by dividing the cells among PE • Each PE computes costs and goodness for a fraction of cells • Division of cells done at beginning & fixed • Slave PEs communicate goodness values to master PE which does Selection & Allocation

Type I Parallelization Results • Runtimes not drastically reduced • Allocation takes significant portion of overall computation time • Speed-ups again limited by communication • Not by data communicated but due to more overheads with increased participants

Type III Parallelization • Parallel Cooperating searches communicating best solutions through a master • Modeled after Parallel SA • A Metropolis loop can be equated with a SimE compound move • Intelligence of SimE can benefit from best solution produced among all • May lead to more rapid convergence to better quality solutions

Type III Parallelization Results • Results quite contrary to expectations • Highest quality inferior than serial algorithm • No speed-ups observed • Parallel Searches used as is doesn’t benefit • possible explanation: greedy behavior of parallel searches is counter productive to SimE intelligence (inability to escape local minima)

Improving Type III Parallelization • Proposed schemes for improved SimE parallel searches • varying frequency of communication with time • Initially exchanges can be frequent and with frequency decreasing to allow more diversification • Intelligently combining good elements in each solution to get a new starting solution • Best location for each cell (or a cluster) can be identified by examining/comparing goodness values among solutions received from peers and constructing a good solution to improve further

Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality

Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality

Presentation Transcript

Parallelization of a Non-Linear Analysis Code

Maintaining Your Sanity While Maintaining Your Fixed Assets

Taxonomy of Hybrid Metaheuristics

Scenario Trees and Metaheuristics for Stochastic Inventory Routing Problems

Metaheuristics

Angular and Linear Speed

Metaheuristics

Linear and angular speed

6.1.4 – Angular Speed, Linear Speed

Maintaining STAR Quality

The Need for Speed: Parallelization in GEM

How To Achieve Quality Sleep

Striving to Achieve… Quality

Maintaining Your Sanity While Maintaining Your Fixed Assets

Linear Motion – Average Speed

Scaling Up While Maintaining Quality: Life After the Like

How to achieve higher redundancy of the UPS for QPS ?

How to achieve higher redundancy of the UPS for QPS ?

Metaheuristics

Geant4 Geometry Speed-ups

Parallelization of Stochastic Evolution for Cell Placement