Empirical investigations of local search on random KSAT for K = 3,4,5,6...

Empirical investigationsof local search on randomKSAT for K = 3,4,5,6... CDInfos0803 Program Kavli Institute for Theoretical Physics China Erik Aurell KTH Royal Institute of Technology Stockholm, Sweden Erik Aurell, KTH Computational Biology

Circumspect descent prevails in solving combinatorial optimization problems Mikko Alava, John Ardelius, E.A., Petteri Kaski, Supriya Krishnamurthy, Pekka Orponen, Sakari Seitz, arXiv:0711.4902 (Nov 30, 2007) Earlier work by E.A., Scott Kirkpatrick and Uri Gordon (2004), Alava, Orponen and Seitz (2005), Ardelius and E.A. (2006), Ardelius, E.A. and Krishnamurthy (2007)……and many others Erik Aurell, KTH Computational Biology

Why did we get into this? Erik Aurell, KTH Computational Biology

Let me give three reasons Erik Aurell, KTH Computational Biology

it is a fundamental and practically important problem ...which I learnt about working for the Swedish railways E.A. J. Ekman, Capacity of single rail yards [in Swedish],Swedish Railway Authority Technical reports (2002) Erik Aurell, KTH Computational Biology

They have potential, under-usedapplications in systems biology As an example I will describe a consulting work we did for Global Genomics, a now defunct Swedish Biotech Company. They claimed to have a new method to measure global gene expression. Many of their ideas were in fact from S. Brenner and K. Livak, PNAS 86 (1989), 8902-06, and K. Kato, Nucleic Acids Res.23 (1995), 3685-3690. Erik Aurell, KTH Computational Biology

The problem is that using only one restriction Type IIS enzyme, there is not enough information in the data to determine which genes were expressed (many genes could have given rise to a given peak). Kato (1995) tried using several enzymes of the same type sequentially. Problem: loss of accuracy, complicated. Global Genomics AB’s invention was to use several enzymes in parallel. Erik Aurell, KTH Computational Biology

The Global Genomics invention in led to a optimal matching problem Matching the observations to a gene database gives a bipartite graph, where a link between a gene g and an observation o represents the fact that o could be an observationof g. The best matching can be represented as a subgraph of the graph above + expression levels. A. Ameur, E.A., M. Carlsson, J. Orzechowski Westholm, “Global gene expression analysis by combinatorial optimization”, In Silico Biology4 (0020) (2004) Erik Aurell, KTH Computational Biology

Testing using the FANTOM data base of mouse cDNA (RIKEN) For in silico testing we used the FANTOM data base of full-length mouse cDNA, available at genome.gsc.riken.go.jp We used an early 2003 version of 60 770 RIKEN full-length clones, partitioned into 33 409 groups representing different genes. This second list can be taken a proxy of all genes in mouse. Principle of in silico tests: 1. Select a fraction of genes 2. Generate random exp. levels 3. Generate random peak and length perturbations 4. Run the algorithm 5. Compare Erik Aurell, KTH Computational Biology

both methods solve the optimization according to the given criteria when the perturbation parameters are small enough the methods are comparable at low or moderate fraction of genes expressed local search is superior at high fraction of genes expressed Ameur et al (2004) Erik Aurell, KTH Computational Biology

In theory,combinatorial optimizationand constraint satisfiabilitygive rise to many of thecomputationally hardestproblems Erik Aurell, KTH Computational Biology

In practice,combinatorial optimizationand constraint satisfaction problems are routinely solved by complete methods (branch-and-bound), local search heuristics, by mixed integer programming, etc. Erik Aurell, KTH Computational Biology

How is this possible?Following many others we will look at a simple model Erik Aurell, KTH Computational Biology

Let there be M logical propositions (clauses) Can all M clauses be satisfied simultaneously? Random K-satisfiability problems Let there be N Boolean variables, and 2N literals A clause expresses that one out of 2k possible configurations of k variables is forbidden. Clauses are picked randomly (with replacement) from all possible k-tuples of variables. Erik Aurell, KTH Computational Biology

KSAT characterized by number of clauses per variable phase transition between almost surely SAT to almost surely UNSAT Algorithms take longest time (on the average) close to phase boundary Several simple algorithms take a.s. linear time for α small enough Mitchell, Selman, Levesque (AAAI-92) Kirkpatrick, Selman, Science264:1297 (1994) Erik Aurell, KTH Computational Biology

A now about decade old statistical physics prediction of 3SAT and other constraint satisfaction problems: a clustering transition UNSAT SAT one state many states many states no solutions 3SAT threshold values Erik Aurell, KTH Computational Biology

The Mezard, Palassini and Rivoire 2005 prediction for 3COL Obtained by entropic cavity method, computing within a 1RSB scenario the number of states with a given number of solutions one green state many green states, but most solutions in one or a few big states Erik Aurell, KTH Computational Biology

The latest clustering predictions for KSAT, K > 3 are in F Krzakała, A. Montanari, F. Ricci-Tersenghi, G. Semerjian, L. Zdeborová. ”Gibbs states and the set of solutions of random constraint satisfaction problems” PNAS 2007 Jun 19;104(25):10318-23. single cluster many clusters and solutions are found in a large set of all about equal size many small clusters but most solutions in a few of them Erik Aurell, KTH Computational Biology

The cluster condensation transition in F Krzakała et al (2007) many clusters and solutions are found in a large set of all about equal size most clusters disappear, and again most solutions are found in a small number of them Erik Aurell, KTH Computational Biology

So does clustering infact pose a problem tosimple local search?Are the known/features of the static landscaperelevant to dynamics? Erik Aurell, KTH Computational Biology

a landscape that could be difficult for local searchcourtesy Sui Huang another local minimum local minima global minimum Erik Aurell, KTH Computational Biology

Papadimitriou invented a stochastic local search algorithm for SAT problems in 1991, today often referred to as RandomWalksat: Pick an unsatisfied clause Pick a variable in that clause, flip it, loop Not quite like an equilibrium physics process in detailed balance, because only variables in unsatisfied clauses are updated Solves 3SAT in linear time on average up to α about 2.7 Erik Aurell, KTH Computational Biology

A benchmark algorithm is Cohen-Kautz-Selman walksat www.cs.wahington.edu/homes/kautz/walksat Pick an unsatisfied clause Compute for each variable in the clause the breakclause breakclause is the number of other, presently satisfied, clauses, that would be broken if the variable is flipped If any variable has breakclause zero, flip it, loop With probability p, flip variable with least breakclause, loop Else, with probability 1-p, flip random variable in clause, loop Solves 3SAT in linear time on average up to α about 4.15 Using default parameters from the public repository (Aurell, Gordon, Kirkpatrick (2004) Erik Aurell, KTH Computational Biology

We have worked with the Focused Metropolis Search (FMS) algorithm, and ASAT, an alternative version ASAT: if you have a solution, output and stop Pick an unsatisfied clause Pick randomly a variable in the clause If flipping that variable decreases the energy, do so If not, flip the variable with probability p Loop Also not in detailed balance (also tries only unsat clauses) Parameter p has to be optimized. The optimal value depends on the problem class, e.g. about 0.2 for 3SAT Erik Aurell, KTH Computational Biology

Algorithm 1. ChainSAT S = random assignment of values to the variables chaining = FALSE while S is not a solution do ifnot chaining then C = a clause not satisfied by S selected uniformly at random V = a variable in C selected uniformly at random end if ΔE = change in the number of unsatisfied clauses if V is flipped in S ifΔE = 0 then flip V in S else ifΔE < 0 then with probability p1 flip V in S end with end if chaining = FALSE ifΔE > 0 then with probability 1 – p2 C = a clause that is satisfied only by V selected uniformly at random X = a variable in C other than V selected uniformly at random V = X chaining = TRUE end with end if end while We have a new algorithm ChainSAT which by design never goes up in energy Erik Aurell, KTH Computational Biology

Solution course of a goodlocal search (ASAT at 4.2) Erik Aurell, KTH Computational Biology

Runtimes for ASAT on 3SATat α=4.21 Ardelius and E.A. (2006) Erik Aurell, KTH Computational Biology

Runtimes for ASAT on 3SATat α=4.25 Ardelius and E.A. (2006) Erik Aurell, KTH Computational Biology

FMS on 4SATat α=9.6 Erik Aurell, KTH Computational Biology

ChainSAT on 4SAT, 5SAT, 6SAT Erik Aurell, KTH Computational Biology

Do we know how localsearch fails on hard CSPs?The first guess would be thatlocal search fails if solutionshave little slackness which isexpressed by Parisi whitening Erik Aurell, KTH Computational Biology

Erik Aurell, KTH Computational Biology

Several proposed clusteringtransitions do not stopcircumspect descentNot even an algorithmwhich would be trapped in a potential well of any depthThe reason why local searcheventually fails is unknown Erik Aurell, KTH Computational Biology

Clustering has been rigorously proven forKSAT and K greater than 8For K less than 8 there arecavity method predictionsHow does numerics compareto these? Erik Aurell, KTH Computational Biology

Solve a 3SAT instance L times with a stochastic local search (ASAT) Compute the overlaps between these L solutions See how that quantity changes with α average overlap variance of the overlap Ardelius, E.A. and Krishnamurthy (2007) Erik Aurell, KTH Computational Biology

The rank ordered plots of the overlaps in a chain of instances with increasing number of clauses displays a transition around 4.25 α ranges from 3.5 to 4.3 N is 2000 for α= 4.3 repeat until solvable instance found for α < = 4.3 repeat until ASAT finds many solutions on the instance Ardelius, E.A. and Krishnamurthy (2007) Erik Aurell, KTH Computational Biology

Generate many chains of instances, check for the α at which all solutions found have an overlap of at least 80% N is 100, 200, 400, 1000, 2000 Number of chains at each N is 110 If a chain does not reach the 80% threshold, repeat Threshold is between 4.25 and 4.27, could in fact coincide with SAT/UNSAT for 3SAT This is not in contradiction with the theoretical predictions of Krzakala et al (2007) who do not address 3SAT Ardelius, E.A. and Krishnamurthy (2007) Erik Aurell, KTH Computational Biology

FMS diffusion 4SAT different α Erik Aurell, KTH Computational Biology

FMS diffusion 4SAT α=9.6 Erik Aurell, KTH Computational Biology

FMS diffusion 4SAT different N Erik Aurell, KTH Computational Biology

As far as numerics cantell, if there are clustersbeyond the clusteringtransitions in 4SAT, theyare not separated byoverlap Erik Aurell, KTH Computational Biology

How does local searchcompare to more sophisticated (andspecialized) methodsthat we will hear aboutat this school?(here I have to go to PDF) Erik Aurell, KTH Computational Biology

A question to the experts:Which is (or are) the goodmetrics to compare runtimes?Wall-clock time? Some intrinsic count? Erik Aurell, KTH Computational Biology

Conclusions Local heuristics (walksat, Focused Metropolis Search, Focused Record-to-Record Travel, ASAT, ChainSAT) are effective on hard random 3SAT, 4SAT… problems This is true even if the heuristic by design can never get out of a potential well, of any depth (ChainSAT). Traps in the landscape do not stop these algorithms. There seems to be a “clustering condensation” transition in 3SAT very close to SAT/UNSAT transition. If there is a clustering transition in 4SAT, these clusters do not seem to be separated in overlap (in contrast to K equal to 8 and greater) Erik Aurell, KTH Computational Biology

KTH/CSC Thanks to John Ardelius Supriya Krishnamurthy Mikko Alava Petteri Kaski Pekka Orponen Sakari Seitz Erik Aurell, KTH Computational Biology

Is the search trapped in “potential wells” of metastable states? Energy as function of time Distance to target N is 1000,  is 4.2 ASAT linear regime, solution in 1000 sweeps Erik Aurell, KTH Computational Biology

Is the search trapped in “potential wells” of metastable states? Energy as function of time Distance to target N is 1000,  is 4.3 ASAT nonlinear regime, no barrier seen Erik Aurell, KTH Computational Biology

Is the search trapped in “potential wells” of metastable states? Energy as function of time Distance to target N is 1000,  is 4.1 ASAT linear regime, solution in 20 sweeps Erik Aurell, KTH Computational Biology

Empirical investigations of local search on random KSAT for K = 3,4,5,6...