当前位置：文档之家› Using Learned Policies in Heuristic-Search Planning

Using Learned Policies in Heuristic-Search Planning

SungWook Yoon Computer Science&Engineering Arizona State University

Tempe,AZ85281

Sungwook.Yoon@https://www.doczj.com/doc/781703769.html,

Alan Fern

Computer Science Department

Oregon State University

Corvallis,OR97331

afern@https://www.doczj.com/doc/781703769.html,

Robert Givan

Electrical&Computer Engineering

Purdue University

West Lafayette,IN47907

givan@https://www.doczj.com/doc/781703769.html,

Abstract

Many current state-of-the-art planners rely on forward heuris-tic search.The success of such search typically depends on heuristic distance-to-the-goal estimates derived from the plangraph.Such estimates are effective in guiding search for many domains,but there remain many other domains where current heuristics are inadequate to guide forward search ef-fectively.In some of these domains,it is possible to learn reactive policies from example plans that solve many prob-lems.However,due to the inductive nature of these learning techniques,the policies are often faulty,and fail to achieve high success rates.In this work,we consider how to ef-fectively integrate imperfect learned policies with imperfect heuristics in order to improve over each alone.We propose

a simple approach that uses the policy to augment the states

expanded during each search step.In particular,during each search node expansion,we add not only its neighbors,but all the nodes along the trajectory followed by the policy from the node until some horizon.Empirical results show that our proposed approach bene?ts both of the leveraged automated techniques,learning and heuristic search,outperforming the state-of-the-art in most benchmark planning domains.

Introduction

Heuristic search has been the most successful and dominant approach for suboptimal AI Planning(Hoffmann&Nebel 2001;Alfonso Gerevini&Serina2003;Vidal2004).The success of this approach is largely due to the development of intelligent automated heuristic calculation techniques based on relaxed plans(RPs),plans that ignore the action effect delete lists.RP-based heuristics provide an effective gra-dient for search in many AI planning domains.However, there are some domains where this gradient provides inade-quate guidance,and heuristic search planners fail to scale up in such domains.

For some of those domains,e.g.Blocksworld,there are automated techniques(Martin&Geffner2000;Fern,Yoon, &Givan2004)that can?nd good policies through machine learning,enabling planners to scale up well.However,in-duced policies lack deductive guarantees and are in practice prone to be faulty—even more so when resource constraints Copyright c 2006

All rights reserved.limit the size of the training data,as occurs in planning competitions.Nevertheless such policies still capture use-ful,though imperfect,constraints on good courses of action. The main goal of this paper is to develop and evaluate an ap-proach for combining such imperfect policies and heuristics in order to improve over the performance of either alone. There are techniques that can improve on imperfect poli-cies.Policy rollout(Bertsekas&Tsitsiklis1996)and lim-ited discrepancy search(LDS)(Harvey&Ginsberg1995) are representative of such techniques.Policy rollout uses online simulation to determine for each encountered state, as it is encountered,which action performs best if we take it and then follow the learned base policy to some horizon. Policy rollout performs very poorly if the base policy is too ?awed to?nd any reward,as all actions look equally attrac-tive.This occurs frequently in goal-based domains,where policy rollout cannot improve on a zero-success-ratio policy at all.

Discrepancy search determines a variable search horizon by counting the number of discrepancies from the base pol-icy along the searched path,so that paths that agree with the policy are searched more deeply.The search cost is expo-nential in the number of discrepancies tolerated,and as a result the search is prohibitively expensive unless the base policy makes mostly acceptable/effective choices.

Due to limitations on the quantity of training data and the lack of any guarantee of an appropriate hypothesis space, machine learning might produce very low quality policies. Such policies are often impossible to improve effectively with either policy rollout or LDS.Here,we suggest using such learned policies during node expansions in heuristic search.Speci?cally,we propose adding not only the neigh-bors of the node being expanded,but also all nodes that oc-cur along the trajectory given by the learned policy from the current node.Until the policy makes a bad decision,the nodes added to the search are useful nodes.

In contrast to discrepancy search,this approach leverages the heuristic function heavily.But in contrast to ordinary heuristic search,this approach can ignore the heuristic for long trajectories suggested by the learned policy.This can help the planner critically in escaping severe local minima and large plateaus in the heuristic function.Policy evalua-tion is typically cheaper than heuristic function calculation and node expansion,so even where the heuristic is working

well,this approach can be faster.

We tested our proposed technique over all of the STRIPS/ADL domains of International Planning Competi-tions(IPC)3and4,except for those domains where the automated heuristic calculation is so effective as to pro-duce essentially the real distance.We used some compe-tition problems for learning policies and then used the rest of the problems for testing our approach.Note that this ap-proach is domain-independent.Empirical results show that our approach performed better than using policies alone and in most domains performed better than state-of-the-art plan-ners.

Planning

A deterministic planning domain D de?nes a set of possible actions A and a set of states S in terms of a set of predicate symbols P,action types Y,and objects C.Each symbol σin P or Y has a de?ned number of arguments it expects,

denoted by arity(σ).A state s is a set of state facts,where a state fact is an application of a predicate symbol p to arity(p) objects from C.An action a is an application of an action type y to arity(y)objects from C.

Each action a∈A is associated with three sets of state facts,Pre(a),Add(a),and Del(a)representing the precondi-tion,add,and delete effects respectively.As usual,an action a is applicable to a state s iff Pre(a)?s,and the applica-tion of an(applicable)action a to s,results in the new state a(s)=(s\Del(a))∪Add(a).

Given a planning domain,a planning problem is a tuple (s,A,g),where A?A is a set of actions,s∈S is the initial state,and g is a set of state facts representing the goal.A solution plan for a planning problem is a sequence of actions (a1,...,a l)from A,where the sequential application of the sequence starting in state s leads to a goal state s where g?s .

Learning Policies

A reactive policyπfor a planning domain D is a function that maps each state s to an action a that is applicable to the state s.We desire a policyπfor which iterative application ofπto the initial state s of a problem p in D will?nd a goal,so that g?τπ(τπ(...τπ(s))),writingτπ(s)for the next state underπ,i.e.,(π(s))(s).Good reactive decision-list policies have been represented and learned for many AI planning domains(Khardon1999;Martin&Geffner2000; Yoon,Fern,&Givan2002;2005).

Taxonomic Decision List Policies

We represent a reactive policy as an ordered list of rules (Rivest1987),each of which speci?es constraints on the ar-guments of an action:

DL={rule1,...,rule n}

rule i= y:x1∈C i1,...,x m∈C im

Here,m is arity(y)for action type y and each C ij is a con-cept expression specifying a set of objects,as described be-low.A rule can?re on any tuple of objects o1,...,o m such that each o i is in set speci?ed by the corresponding C ij—upon?ring the rule suggests action y(o1,...,o m).The ear-liest rule in the list that can be?red in the current state will be?red.If more than one action is suggested by that rule, the tie is broken lexicographically based on an arbitrary un-changing ordering of the domain objects.

The syntax for the concepts is the following.

C=any-object|C∩C|?C|

(p C1...C i?1?C i+1...C arity

(p)

) Here,p is a predicate symbol,from P or as speci?ed below. The predicate symbol p is applied to smaller class expres-sions,but with one argument omitted,indicated by a“*”. Such applications denote the class of objects that make the predicate true when?lled in for“*”,given that the other ar-guments of p(if any)are provided to match the class expres-sions given.The semantics of the other constructs as classes of objects is as expected—please refer to(Yoon,Fern,& Givan2002;2006;McAllester&Givan1993)for the de-tailed speci?cation of the syntax and semantics.Note that this syntax automatically derives from predicate symbols of the target planning domain.

The evaluation of the semantics of each class C is rela-tive to the relational database D constructed from the cur-rent state,the goal information and other information that is deduced from the current state and goal information.In our case this includes the relaxed plan from the current state to a goal state,a form of reasoning to construct features(Geffner 2004)and the re?exive transitive closure p?for each binary predicate p.Predicate symbols other than those in P are al-lowed in order to represent goal information and features of the relaxed plan,as described in detail in(Yoon,Fern,& Givan2002;2006).

As an example decision list policy,consider a simple Blocksworld problem where the goal is clearing off a block A.The following decision-list policy is an optimal policy. { putdown(x1):x1∈holding(?) , unstack(x1):x1∈clear(?)∩(on??A) }.The?rst rule says“putdown any block that is being held”and the second rule says“unstack any block that is above the block A and clear”.

Learning Reactive Policies

For our evaluation,we learn reactive policies from solved small problems.For this purpose we deploy a learner similar to that presented in(Yoon,Fern,&Givan2002).Given a set of training problems along with solutions we create a train-ing set to learn a classi?er that will be taken to be the learned policy.The classi?er training set contains states labeled by positive and negative actions.The positive actions are those that are contained in the solution plan for each state,and all other actions in a state are taken to be negative.The learn-ing algorithm conducts a beam search through the candidate class expressions,greedily seeking constraints on each ar-gument of each action type to match the positive actions and not match the negative actions.Note that the training set is noisy in the sense that not all actions labeled as negative are actually bad.That is,there can be many optimal/good actions from a single state,though only one is included in

the training solutions.This noise is one reason that learned policies can be imperfect.

Using Non-Optimal Policies

Typically learned policies will not be perfect due to the bias of the policy language and variance of the learning proce-dure.In some domains,the imperfections are not catas-trophic and the policies still obtain high success rates.In other domains,the?aws lead to extremely poor success rates.Nevertheless,in many states,even learned policies with poor success rates suggest good actions and we would like methods that can exploit this information effectively. First,we describe two existing techniques,rollout and dis-crepancy search that utilize search to improve upon an im-perfect policy.Our experiments will show that these tech-niques do not work well in many planning domains,typi-cally where the quality of the learned policy is low.These failures led us to propose a new approach to improving poli-cies with search,which is discussed at the end of this section. Policy Rollout

Length

n +

1i )

Figure1:Policy Rollout.At a state s,for each action a,roll-out the input policyπfrom state a(s)until some horizon. For deterministic domains,the length of the rollout trajec-tory plus the heuristic at the end can be used as the Q-value for that action.Select the best action.

Policy rollout(Bertsekas&Tsitsiklis1996)is a technique for improving the performance of a non-optimal“base”pol-icy using simulation.Policy rollout sequentially selects the action that is the best according to the one-step lookahead policy evaluations of the base policy.Figure1shows the one-step lookahead action selection.For each action,this procedure simulates the action from the current state and then simulates the execution of the base policy from the re-sulting state for some horizon or until the goal is found.Each action is assigned a cost equal to the length of the trajectory following it plus the value of a heuristic applied at the?nal state(which is zero for goal states).For stochastic domains the rollout should be tried several times and the average of the rollout trials is used as a Q-value estimate.Policy roll-out then executes the action that achieved the smallest cost from the current state and repeats the entire process from the resulting state.While policy rollout can improve a pol-icy that is mostly optimal,it can perform poorly when the policy commits many errors,leading to inaccurate action costs.Multi-level policy rollout,e.g.as used in(Xiang Yan &Van Roy2004),is one way to improve over the above procedure by recursively applying rollout,which takes time exponential in the number of recursive applications.This approach can work well when the policy errors are typically restricted to the initial steps of trajectories.However,for policies learned in planning domains the distribution of er-rors does not typically have this form;thus,even multi-step rollout is often ineffective at improving weak policies. Discrepancy Search

A discrepancy in a search is a search step that is not selected by the input policy or heuristic function.Limited discrep-ancy search(LDS)(Harvey&Ginsberg1995)bounds the search depth to some given number of discrepancies.Dis-crepancy search limited to n disrepancies will consider ev-ery search path that deviates from the policy at most n times.

1 1

222

Figure2:An Example of Discrepancy Search.Thin edges in the?gure represent discrepancies.Nodes are labeled with their discrepancy depth.

Figure2shows an example of discrepancy search.The thick lines are choices favored by the heuristic function and the thin lines are discrepancies.Each node is shown as a rectangle labeled by the number of discrepancies needed to reach that node.Consider the depth of the goal in the search tree and the number of discrepancies needed to reach the goal node from the root node.In?gure2,the goal node is at depth three and discrepancy depth one from the root node. Plain DFS or BFS search will search more nodes than limit one disrepancy search.DFS and BFS need to visit14nodes before reaching the goal,while discrepancy search with limit one will?nd the goal after a9-node search.Greedy heuristic search on this tree needs to backtrack many times before it reaches the goal node.The search space of LDS has size exponential in the discrepancy bound,and so,like policy rollout,LDS also is only effective with policies or heuristic functions with high quality.Heuristics can be incorporated by applying the heuristic at the leaves of the discrepancy search tree and then selecting the action that lead to the best heuristic value.

Incorporating Policies into Heuristic Search

In this work,we consider an alternative approach for in-corporating imperfect policies into search.We attempt to take advantage of both approaches,automated heuristic cal-culation and automated policy learning.The main idea is to use policies during node expansions in best-?rst heuristic search,as described by the node expansion function shown in Figure3.At each node expansion of the best-?rst search, we add to the search queue the successors of the current best node as usual,but also add the states encountered by follow-ing the policy from the current best node for some horizon. Our approach is similar to Marvin(Coles&Smith2004), MacroFF(Botea et al.2004),and Y AHSP(Vidal2004). However,unlike these approaches,we do not just add the?-nal state encountered by the policy or macro,but rather add all states along the trajectory.

Embedding the policy into node expansion during heuris-tic search yields two primary advantages over pure heuris-tic search.First,when the input policy correlates well with the heuristic values,the embedding can reduce the search time.Typically,the direct policy calculation is much faster than greedy heuristic search because greedy heuristic search needs to compute the heuristic value of every neigh-bor,while direct policy execution considers only the current state in selecting actions.Second,like Blocksworld,where heuristic calculation frequently underestimates the true dis-tance,our node expansion can lead the heuristic search out of local minima.

Node-Expansion-with-Policy(s,π,H)

//problem s,a policyπ,a horizon H

N←Neighbors(s)

s ←s

for i=0until i==H

s ←π(s )

N←N∪{s }

Return N

Figure3:Node expansion for a heuristic search with a pol-icy.Add nodes that occur along the trajectory of the input policy as well as the neighbors

Experiments

We evaluated the above approaches on the STRIPS domains from the recent international planning competitions IPC3 and IPC4,and on Blocksworld.We show the performance of each planning system in each row of the following?g-ures.To test the effectiveness of our approach,we tested our base system,LEN,as well as our proposed technique, PH.The LEN system uses best-?rst search with the relaxed-plan–length(RPL)heuristic.We also compare against the planner FF,and,for each IPC4domain,against the best planner in the competition on the particular domain.Finally, we compare against policy rollout(PR)and limited discrep-ancy search(D)techniques.For each system and domain,we report the number of solved problems,the average solu-tion time for solved problems and the average length of the

solved problems in separate columns of the results.

We used the?rst15problems as training data in each do-main and the remaining problems for testing.We considered

a problem to be unsolved if it was not solved within30min-utes.For PR,D and PH systems,we used a horizon of1000. All the experiments were run on Linux machines with a2.8

Xeon Processor and2GB of RAM.

Blocksworld

Figure4shows the results on Blocksworld from Track1of IPC2.In this domain,LEN,PH,D solved all the prob-lems but FF and PR failed to solve some problems.We

used100cpu secs in learning the policy.The learned policy,πBlocksworld solved13problems.Thus,all of the policy im-provement approaches,PR,D,and PH improved the input

policy.The PH system solved all of the problems and im-proved the solution time over LEN,PR and D,showing the bene?t of our approach for speeding-up planning,though PH produced slightly longer plans than FF and D.

Blocksworld(IPC2)

Systems Solved(20)↑Time↓Length↓

FF160.6438.1

LEN2011.74116.3

PR197.8937.8

D20 2.0938

PH200.0644

Figure4:Blocksworld Results

IPC3

Figure5summarizes the results on the IPC3domains.The domains are signi?cantly more dif?cult to learn policies for and we used8h,6h and6h of cpu time for learning the poli-cies in Depots,Driverlog,and FreeCell respectively.Al-though the learning times are substantial,this is a one-time cost:such learned policies can be resused for any problem instance from the domain involved.In these domains,the learned policies cannot solve any of the tested problems by themselves.

We see that all of PR,D and PH are able to improve on the input policies.PH solved more problems overall than FF,though FF has a slight advantage in solution length in the Depots domain.The Freecell domain has deadlock states and our policy performs poorly,probably by leading too of-ten to deadlock states.Still the experiments show that our approach attained better solution times,demonstrating again that using policies in search can reduce the search time by quickly?nding low-heuristic states.

PH outperforms LEN decisively in Depots and Driver-log,showing a clear bene?t from using imperfect policies in heuristic search for these domains.

IPC4

Figure6shows the results for the IPC4domains.We used 18,35,1,3and4hours of CPU time for learning poli-

IPC3

Domain Systems Solved(5)↑Time↓Length↓

FF5 4.3254.2

LEN10.2829.0

Depots PR270.8632

D334.2336.5

PH5 3.7363.2

FF11105167.0

LEN11623167.0

Driverlog PR0--

D0--

PH475.91177.3

FF5574.84108.4

LEN5442.94109.0

Freecell PR3545.11111.3

D3323.2385.6

PH4217.49108.3

Figure5:IPC3results

cies for Pipesworld,Pipesworld-Tankage,PSR,Philosopher and Optical Telegraph respectively.Here we evaluated one more system,B,which in each domain denotes the best per-former from that domain from IPC4.The numbers for B were downloaded from the IPC web site and cannot be di-rectly compared to our systems results.Still,the numbers give some idea of the performance comparison.Note that the learned policies alone can solve all the problems from the Philosopher and Optical Telegraph domains.

PH generally outperforms FF in solved problems,render-ing solution length measures incomparable,except in PSR where the two approaches essentially tie.

For Pipesworld and Pipesworld-with-Tankage domains, the best performers of those domains in the competition per-formed much better than our system PH.However,PH still performed better than LEN,FF,PR and D,showing the ben-e?ts of our approach.Marvin,Macroff and YAHSP partic-ipated the competition.Each of these macro action based planners solved60or so problems.The best planners for each domain combined solved105problems.PH solved133 problems,showing a clear advantage for our technique. Overall,the results show that typically our system is able to effectively combine the learned imperfect policies with the relaxed-plan heuristic in order to improve over each alone.This is a novel result.Neither policy rollout nor dis-crepancy search are effective in this regard and we are not aware of any prior work that has successfully integrated im-perfect policies into heuristic search planners.

Heuristic Value Trace in Search

In order to get a view of how incorporating policies can im-prove the heuristic search,we plotted a trace of the heuristic value for each node expansion during the search for problem 20of Depots.Figure7shows a long plateau of heuristic val-ues by the LEN system,while Figure8shows large jumps in heuristic value for the PH system,using far fewer node ex-pansions,indicating that the policies are effectively helping to escape plateaus in the search space.

IPC4

Domain Systems Solved(35)↑Time↓Length↓

FF2171.2048.2

LEN1571.3848.2 Pipesworld B35 4.9474.6

PR18472.7959.5

D14215.5149.3

PH29129.2176.7

FF4532.2062.0

LEN4333.3462.0 Pipesworld B28221.96165.6

Tankage PR7366.1149.7

D5428.7166.4

PH21124.0786.3

FF17638.42104.2

PSR LEN22646.59107.2

(middle B18124.93106.8

complied)PR6691.0472.0

D4225.5655.5

PH17659.37107.0

FF0--

LEN0--Philosophers B140.20258.5

PR33 3.89363.0

D33 2.59363.0

PH33 2.59363.0

FF0--

LEN0--

Optical B10721.13387.0

Telegraph PR3323.77594.0

D3319.67594.0

PH3319.67594.0

Figure6:IPC4results

In Figures10and9,we show the heuristic traces for LEN and PH on a Freecell problem where PH failed.Here PH is unable to escape from the plateau,making only small jumps in heuristic value over many node expansions.Interestingly, the trace of LEN system shows a plateau at a higher heuris-tic value than that for PH followed by a rapid decrease in heuristic value upon which the problem is solved.We are currently investigating the reasons for this behavior.At this point,we speculate that the learned policy manages to lead the planner away from a useful exit.

Conclusion

We embedded learned policies in heuristic search by follow-ing the policy during node expansion to generate a trajectory of new child nodes.Our empirical study indicates advan-tages to this technique,which we conjecture to have three sources.When the policy correlates well with the heuris-tic function,the embedding can speed up the search.Sec-ondly,the embedding can help escape local minima during the search.Finally,the heuristic search can repair faulty ac-tion choices in the input policy.Our approach is easy to implement and effective,and to our knowledge represents the?rst demonstration of effective incorporation of imper-fect policies into heuristic search planning.

References

Alfonso Gerevini,A.S.,and Serina,I.2003.Planning through stochastic local search and temporal action graphs in lpg.Journal of Arti?cial Intelligence Research20:239–290.

Bertsekas,D.P.,and Tsitsiklis,J.N.1996.Neuro-Dynamic Programming.Athena Scienti?c.

Botea,A.;Enzenberger,M.;Muller,M.;and Schaeffer,J. 2004.Macro-ff.In4th International Planning Competi-tion.

Coles,A.I.,and Smith,A.J.2004.Marvin:Macro-actions from reduced versions of the instance.IPC4Book-let,ICAPS2004.Extended Abstract.

Fern,A.;Yoon,S.;and Givan,R.2004.Learning domain-

speci?c control knowledge from random walks.In ICAPS. Geffner,H.2004.Planning graphs and knowledge compi-lation.In International Conference on Principles of Knowl-edge Representation and Reasoning.

Harvey,W.D.,and Ginsberg,M.L.1995.Limited dis-crepancy search.In Mellish,C.S.,ed.,Proceedings of the Fourteenth International Joint Conference on Arti?cial In-telligence(IJCAI-95);Vol.1,607–615.Montr′e al,Qu′e bec, Canada:Morgan Kaufmann,1995.

Hoffmann,J.,and Nebel,B.2001.The FF planning sys-tem:Fast plan generation through heuristic search.JAIR 14:263–302.

Khardon,R.1999.Learning action strategies for planning

domains.AIJ113(1-2):125–148.

Martin,M.,and Geffner,H.2000.Learning generalized policies in planning domains using concept languages.In KRR.

McAllester,D.,and Givan,R.1993.Taxonomic syntax for ?rst-order inference.Journal of the ACM40:246–283. Rivest,R.1987.Learning decision lists.MLJ2(3):229–246.

Vidal,V.2004.A lookahead strategy for heuristic search planning.In International Conference on Automated Plan-ning and Scheduling.

Xiang Yan,Persi Diaconis,P.R.,and Van Roy,B.2004. Solitaire:Man versus machine.In NIPS.

Yoon,S.;Fern,A.;and Givan,R.2002.Inductive policy selection for?rst-order MDPs.In UAI.

Yoon,S.;Fern,A.;and Givan,R.2005.Learning measures of progress for planning domains.In AAAI.

Yoon,S.;Fern,A.;and Givan,R.2006.Learning heuristic functions from relaxed plans.In ICAPS.