Learning to Improve the Quality of Plans Produced by Partial-order Planners

Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab

Outline • Artificial Intelligence Planning: Problems and Solutions • Why Learn to Improve Plan Quality? • The Performance Improving Partial-order planner (PIP) • Intra-solution Learning (ISL) algorithm • Search-control vs Rewrite rules • Empirical Evaluation • Conclusion

The Performance Task: The Classical AI Planning 1 3 4 Given: 8 2 Initial State 7 6 5 1 2 3 Goals 8 4 7 6 5 Actions:{up, down, left, right} Find: A sequence of actions that achieves the goals when executed in the initial state e.g., down(4), right(3), up(2)

Automated Planning Systems • Domain Independent Planning Systems • Modular, Sound, and Complete • Domain-dependent Planning Systems • Practical, Efficient, Produce high quality plans

Domain Independent Systems • State-space Search (each search node is a valid world state) • e.g., PRODIGY, FF • Partial-order Plan Space Search (each search node is a partially-ordered plan) • Partial-order planners e.g., SNLP, UCPOP • Graphplan-based Search (a search node is a union of world states) e.g., STAN • Compilation to General Search • satisfiability engines e.g., SATPLAN • constraint satisfaction engines e.g., CPLAN

State-space vs Plan-space Planning 1 2 3 8 4 right(8) right(8) 7 6 5 1 2 3 1 3 d(2) 8 4 8 2 4 down(2) END 7 6 5 l(4) 7 6 5 1 2 3 left(4) 8 4 up(6) 7 6 5 1 2 3 up(6) 8 4 7 6 5

Partial-order Plan-space Planning • Partial-order planning is the process of removing flaws (unresolved goals and unordered actions that cannot take place at the same time)

1 4 3 2 Partial-order Plan-space Planning • Decouple the order in which actions are added during planning from the order in which they appear in the final plan

Learning to Improve Plan Quality for Partial-order Planners • How to represent plan quality information? • Extended STRIPS operators + value function • How to identify learning opportunities? (there are no planning failures or successes to learn from) • Assume a better quality model plan for a given problem is available (from a domain expert or a through a more extensive automated search of the problem’s search space) • What search features to base the quality improving search control knowledge on?

The Logistics Transportation Domain at-object(parcel, postoffice) at-truck(truck1, postoffice) at-plane(plane1, airport) Initial State: Goals: at-object(parcel, airport)

STRIPS encoding of the Logistics Transportation Domain Preconditions: {at-object(Object,Location), at-truck(Truck,Location)} LOAD-TRUCK(Object, Truck, Location) Effects: {in(Object,Truck), not(at-object(Object,Location))} Preconditions: {in(Object,Truck), at-truck(Truck,Location)} UNLOAD-TRUCK(Object, Truck, Location) Effects: {at-object(Object,Location), not(in(Object,Truck))} Preconditions: {at-truck(Truck,From)} DRIVE-TRUCK(Truck, From, To) Effects: {at-truck(Truck,To), not(at-truck(Truck,From), same-city(From, To)}

PR-STRIPS (similar to PDDL 2.1 level 2) • A state is described using propositional as well as metric attributes (that specify the levels of the resources in that state). • An action can have propositional as well as metric effects (functions which specify the amount of resources the action consumes). • A value function that specifies the relative importance of the amount of each resource consumed and defines plan quality as a function of the amount of resources consumed by all actions in the plan.

PR-STRIPS encoding of the Logistics Transportation Domain Preconditions: {at-object(Object,Location), at-truck(Truck,Location)} LOAD-TRUCK(Object, Truck, Location) Effects: {in(Object,Truck), not(at-object(Object,Location)), time(-0.5), money(-5)} Preconditions: {in(Object,Truck), at-truck(Truck,Location)} UNLOAD-TRUCK(Object, Truck, Location) Effects: {at-object(Object,Location), not(in(Object,Truck)), time(-0.5), money(-5) } Preconditions: {at-truck(Truck,From)} DRIVE-TRUCK(Truck, From, To) Effects: {at-truck(Truck,To), not(at-truck(Truck,From), time(-.02*distance(From, To)), money(-distance(From, To))}

PR-STRIPS encoding of the Logistics Transportation Domain Preconditions: {at-object(Object, Location), at-plane(Plane, Location)} LOAD-PLANE(Object, Plane, Location) Effects: {in(Object, Plane), not(at-object(Object, Location)), time(-0.5), money(-5)} Preconditions: {in(Object, Plane), at-plane(Plane, Location)} UNLOAD-PLANE(Object, Plane, Location) Effects: {at-object(Object, Location), not(in(Object, Plane)), time(-0.5), money(-5) } Preconditions: {at-plane(Plane, From), airport(To)} FLY-PLANE(Plane, From, To) Effects: {at-plane(Plane,To), not(at-plane(Plane, From), time(-.02*distance(From, To)), money(-distance(From, To))}

PR-STRIPS encoding of the Logistics Transportation Domain Quality(Plan) = 1/ (2*time-used(Plan) + 5*money-used(Plan))

The Learning Problem • Given • A planning problem (goals, initial state, and initial resource level) • Domain knowledge (actions, plan quality knowledge) • A partial-order planner • A model plan for the given problem • Find • Domain specific rules that can be used by the given planner to produce better quality plans (than the plans it would’ve produced had it not learned those rules).

Solution: The Intra-solution Learning Algorithm • Find a learning opportunity • Choose the relevant information and ignore the rest • Generalize the relevant information using a generalization theory

Phase 1: Find a Learning Opportunity • Generate a system’s default plan and a default planning trace using the given partial-order planner for the given problem • Compare the default plan with the model plan. If the model plan is not of higher quality then goto Step 1 • Infer the planning decisions that produced the model plan • Compare the inferred model planning trace with the default planning trace to identify the decision points where the two traces differ. These are the conflicting choice points

Model Trace System’s Planning Trace Common Nodes

Phase 2: Choose the relevant Information • Examine the downstream planning traces identifying relevant planning decisions using the heuristics • A planning decision to add an actionQ is relevant if Q supplies a relevant condition to a relevant action • A planning decision to establish an open condition is relevant if it binds an uninstantiated variable of a relevant open condition • A planning decision to resolve a threat is relevant if all three actions involved are relevant

Phase 3: Generalize the Relevant Information • Generalize the relevant information using a generalization theory • Replace all constants with variables

An Example Logistics Problem Initial-state: {at-object(o1, lax), at-object(o2, lax), at-truck(tr1, lax), at-plane(p1, lax), airport(sjc), distance(lax, sjc)=250, time=0, money=500} Goals: {at-object(o1, sjc), at-object(o2, sjc)}

Generate System’s Default Plan and Default Planning Trace • Use the given planner to generate system’s default planning trace (an ordered constraint set) • Each add-step/establishment decision adds a causal-link and an ordering constraint • Each threat-resolution decision adds an ordering constraint 1- START ‹ END, 2- unload-truck() ‹ END, unload-truck(o1,Tr,sjc) at-object(o1,sjc) END 3- load-truck() ‹ unload-truck(),load-truck(o1,Tr, sjc) in-truck(o1,Tr) unload-truck(o1,Tr, sjc) 4- drive-truck() ‹ unload-truck(),drive-truck(Tr, X, sjc) at-truck(Tr, sjc) unload-truck(o1,Tr, sjc) 5- …

Compare System’s Default Plan with the Model Plan System’s Default Plan Model Plan load-truck(o1, tr1, lax), load-truck(o2, tr1,lax), drive-truck(tr1, lax, sjc), unload-truck(o1, tr1, sjc), unload-truck(o2, tr1, sjc) load-plane(o1, p1, lax), load-plane(o2, p1, lax), fly-plane(p1, lax, sjc), unload-plane(o1, p1, sjc), unload-plane(o2, p1, sjc)

Infer the Unordered Model Constraint Set unload-plane(ol,p1,sjc) at-object(o1,sjc)END load-plane(ol,p1,lax) at-object(o1,sjc)unload-plane(o1,p1,sjc) fly-plane(p1,sjc,lax) at-plane(p1,sjc)unload-plane(o1,p1,sjc) START at-plane(p1,lax) load-plane(ol,p1,lax) START at-plane(p1,lax) fly-plane(ol,p1,lax) START at-object(o1,lax)load-plane(ol,p1,lax) unload-plane(o2,p1,sjc) at-object(o2,sjc)END load-plane(o2,p1,lax) at-object(o2,sjc)unload-plane(o2,p1,sjc) fly-plane(p1,sjc,lax) at-plane(p1,sjc)unload-plane(o2,p1,sjc) START at-plane(p1,lax) load-plane(o2,p1,lax) START at-plane(p1,lax) fly-plane(o2,p1,lax) START at-object(o2,lax)load-plane(o2,p1,lax)

Compare the two Planning Traces to Identify Learning Opportunities START‹ END at-object(o1,sjc) A learning opportunity START ‹ END, unload-truck(o1,tr1,sjc) ‹ END unload-truck(o1,t1,sjc) at-object(o1,tr1,sjc) END START ‹ END, unload-plane(o1,p1,ap) ‹ END unload-plane(o1,p1,sjc) at-object(o1,p1,sjc) END

Choose the Relevant Planning Decisions add-actions:START-END learning opportunity add-action:unload-plane(o1) add-actions:unload-truck(o1) relevant decisions add-action:fly-plane() add-action:drive-truck() add-action:load-plane(o1) add-actions:load-truck(o1) add-action:unload-plane(o2) add-action:drive-truck() irrelevant decisions add-action:load-plane(o2) add-actions:load-truck(o2)

Generalize the relevant planning decisions chains add-actions:START-END add-action:unload-plane(O, T) add-actions:unload-truck(O, P) add-action:fly-plane(T,X,Y) add-action:drive-truck(P,X,Y) add-action:load-plane(O, T) add-actions:load-truck(O, P)

In What Form Should the Learned Knowledge be Stored? Search-Control Rule Given the goals {at-object(O,Y)} to resolve and effects {at-truck(T,X), at-plane(P, X), airport(Y)}, and distance(X, Y) > 100 prefer the planning decisions {add-step(unload-plane(O,P,Y)), add-step(load-plane(O,P,X)), add-step(fly-plane(P,X,Y))} over the planning decisions {add-step(unload-truck(O,T,Y)), add-step(load-truck(O,T,X)), add-step(drive-truck(T,X,Y))} Rewrite Rule To-be-replaced actions {load-truck(O,T,X), drive-truck(T,X,Y), unload(O,T, Y)} Replacing actions {load-plane(O,P,X), fly-plane(P,X,Y), unload-plane(O,P,Y))}

Search Control Knowledge • A heuristic function that provides an estimate of the quality of the plan a node is expected to lead to root n quality=4 quality=8 quality=2

Rewrite Rules • A Rewrite rule is a 2-tuple to-be-replaced-subplan, replacing-subplan • Used after search has produced a complete plan to rewrite it into a higher quality plan. • Only useful in those domains where it is possible to efficiently produce a low quality plan but hard to produce a higher quality plan • E.g., To-be-replaced-subplan: A4, A5 Replacing subplan: B1

Planning by Rewriting A2 A4 A1 A6 A5 A3 B1

Empirical Evaluation I: What Form Should the Learned Knowledge be Stored in? • Perform empirical experiments to compare the performance of a version of PIP that learns search-control rules (Sys-search-control) with a version that learns rewrite rules (Sys-rewrite). • Both Sys-rewrite-first and Sys-rewrite-best perform up to two rewritings. • At each rewriting • Sys-rewrite-first randomly chooses one of the applicable rewrite rules • Sys-rewrite-best applies all applicable rewrite rules to try all ways of rewriting a plan.

Experimental Set-up • Three benchmark planning domains logistics, softbot, and process planning • Randomly generate 120 unique problem instances • Train Sys-search-control and Sys-rewrite on optimal quality solutions for 20, 30, 40, and 60 examples and test them on the remaining examples (cross-validation) • Plan quality is one minus the average distance of the plans generated by a system from the optimal quality plans • Planning efficiency is measured by counting the average number of new nodes generated by each system

Results Softbot Logistics Process Planning

Conclusion I • Both search control and rewrite rules lead to improvements in plan quality. • Rewrite-rules have a larger cost in terms of the loss of planning efficiency than search control rules • Need a mechanism to distinguish good rules from bad rules and to forget the bad rules • Comparing planning traces seems to be a better technique for learning search control rules than rewrite rules • Need to explore alternate strategies for learning rewrite rules • By comparing two completed plans of different quality • Through static domain analysis

Empirical Evaluation II: A Study of the Factors Affecting PIP’s Learning Performance • Generated 25 abstract domains varying along a number of seemingly relevant dimensions • Instance Similarity • Quality Branching Factor (average number of multiple quality solutions per problem) • Association between the default planning bias and the quality bias • Are there any statistically significant differences in PIP’s performance as each factor is varied (student t-test)?

Results • PIP’s learning leads to greater improvements in domains where • Quality branching factor is large • Planner’s default biases are negatively correlated with the quality improving heuristic function • There is no simple relationship between instance similarity and PIP’s learning performance

Conclusion II • Need to address scale up issues • Need to keep up with advances in AI planning technologies • “It is arguably more difficult to accelerate a new generation planner by outfiting it with learning as the overhead cost by the learning system can overwhelm the gains in search efficiency” (Kambhampati 2001) • Problem is not the lack of a well defined task! • Organize a symposium/special issue on issues of how to efficiently organize, retrieve, and forget learned knowledge • An open source planning and learning software?

Learning to Improve the Quality of Plans Produced by Partial-order Planners